New Insight regarding Legionella Non-Pneumophila Species Identification: Comparison between the Traditional mip Gene Classification Scheme and a Newly Proposed Scheme Targeting the rpoB Gene

ABSTRACT The identification of Legionella non-pneumophila species (non-Lp) in clinical and environmental samples is based on the mip gene, although several studies suggest its limitations and the need to expand the classification scheme to include other genes. In this study, the development of a new classification scheme targeting the rpoB gene is proposed to obtain a more reliable identification of 135 Legionella environmental isolates. All isolates were sequenced for the mip and rpoB genes, and the results were compared to study the discriminatory power of the proposed rpoB scheme. Complete concordance between the mip and rpoB results based on genomic percent identity was found for 121/135 (89.6%) isolates; in contrast, discordance was found for 14/135 (10.4%) isolates. Additionally, due to the lack of reference values for the rpoB gene, inter- and intraspecies variation intervals were calculated based on a pairwise identity matrix that was built using the entire rpoB gene (∼4,107 bp) and a partial region (329 bp) to better evaluate the genomic identity obtained. The interspecies variation interval found here (4.9% to 26.7%) was then proposed as a useful sequence-based classification scheme for the identification of unknown non-Lp isolates. The results suggest that using both the mip and rpoB genes makes it possible to correctly discriminate between several species, allowing possible new species to be identified, as confirmed by preliminary whole-genome sequencing analyses performed on our isolates. Therefore, starting from a valid and reliable identification approach, the simultaneous use of mip and rpoB associated with other genes, as it occurs with the sequence-based typing (SBT) scheme developed for Legionella pneumophila, could support the development of multilocus sequence typing to improve the knowledge and discovery of Legionella species subtypes. IMPORTANCE Legionella spp. are a widely spread bacteria that cause a fatal form of pneumonia. While traditional laboratory techniques have provided valuable systems for Legionella pneumophila identification, the amplification of the mip gene has been recognized as the only useful tool for Legionella non-pneumophila species identification both in clinical and environmental samples. Several studies focused on the mip gene classification scheme showed its limitations and the need to improve the classification scheme, including other genes. Our study provides significant advantages on Legionella identification, providing a reproducible new rpoB gene classification scheme that seems to be more accurate than mip gene sequencing, bringing out greater genetic variation on Legionella species. In addition, the combined use of both the mip and rpoB genes allowed us to identify presumed new Legionella species, improving epidemiological investigations and acquiring new understanding on Legionella fields.

IMPORTANCE Legionella spp. are a widely spread bacteria that cause a fatal form of pneumonia. While traditional laboratory techniques have provided valuable systems for Legionella pneumophila identification, the amplification of the mip gene has been recognized as the only useful tool for Legionella non-pneumophila species identification both in clinical and environmental samples. Several studies focused on the mip gene classification scheme showed its limitations and the need to improve the classification scheme, including other genes. Our study provides significant advantages on Legionella identification, providing a reproducible new rpoB gene classification scheme that seems to be more accurate than mip gene sequencing, bringing out greater genetic variation on Legionella species. In addition, the combined use of both the mip and rpoB genes allowed us to identify presumed new Legionella species, improving epidemiological investigations and acquiring new understanding on Legionella fields. and environmental Lp strains (24,25) and a database based on macrophage infectivity potentiator (mip) gene sequencing for non-Lp isolates (26,27). Currently, while for clinical and environmental Lp strains, a multilocus typing scheme has been developed by the EWGLI, represented by a SBT approach (24,25), regarding the non-Lp isolates, identification is still based only on mip gene sequencing (26,27), and no recognized and standardized typing approach was developed. Regarding the identification of Legionella species, several genetic markers have been proposed, including 16S rRNA, which was subsequently replaced by the mip gene, as this gene can overcome the limitations of intraspecies heterogenicity in the 16S rRNA gene (28). However, some species and some environmental isolates could not be confidently discriminated by the mip scheme, such as L. geestiana or European wild strain LC4381 (29).
Another gene that is widely used for bacterial identification is the rpoB gene. This gene encodes a subunit of DNA-dependent RNA polymerase, and mutations in its sequence are known to cause rifampicin resistance. rpoB DNAs comprise a highly conserved region throughout bacteria that may be used for bacterial classification (30). It can identify enteric bacteria, Mycobacterium, spirochetes, and Legionella species, including some causative agents of Legionnaires' disease (30,31). Regarding the identification of non-Lp, the nucleotide variation of rpoB is able to differentiate these species better than 16S rRNA and mip in some cases (31). The partial rpoB sequence (300 bp) can guarantee the genotypic classification of Lp and blue-white autofluorescent species (31). This region can distinctly differentiate species that share high similarities in their 16S rRNA gene sequences and that cannot be analyzed successfully by mip (26,31). Thus, rpoB analysis could clearly differentiate among Legionella spp.
Although rpoB has higher intraspecies variability, it is widely used for bacterial identification, and it is considered, in some cases, to be the best approach, such as for nontuberculous mycobacteria (NTM) and Acinetobacter. This marker is not sufficient for Legionella classification, especially for non-Lp, although different studies have already suggested combining rpoB with the mip gene to identify these species more accurately (32)(33)(34). In addition, in the scientific literature, there are reference values for mip gene analysis that can be used to determine the inter-and intraspecies nucleotide variation; however, for rpoB, there are no works that establish reference intervals (26), and this limits the application of the rpoB gene as a marker in the classification scheme.
In the present study, 135 Legionella spp. strains recovered from environmental communities were analyzed for rpoB gene sequencing, and the results obtained were compared with a mip gene sequencing identification scheme to study the discriminatory power of rpoB sequences and establish an inter-and intraspecies variation interval to improve the use of the rpoB gene as a target for non-Lp identification.

RESULTS
All 135 isolates showed positive growth on BCYE cys 1 and negative growth on BCYE cys 2 and tryptone soy agar (TSA) with 5% sheep blood agar. Moreover, the agglutination for Legionella species antisera test displayed positive results for 34/135 isolates (25.2%) and ambiguous results for 10/135 (7.4%) isolates; in contrast, most isolates (91/135 [67.4%]) showed negative results for the agglutination test. All of them were then submitted for gene amplification as previously described.
The results obtained by mip and rpoB gene sequencing and their ranges of matches compared to the reference strains are shown in Table 1, where the 14 isolates with a discrepancy in the nucleotide identity percentage for mip, rpoB, or both genes are highlighted in bold. Regarding mip gene identification, compared with the respective reference strains, our isolates showed a nucleotide identity interval of 98.2% to 100%, with the exception of two L. anisa isolates and one L. quinlivanii isolate with nucleotide identities of 96.7% and 96.2%, respectively. However, the rpoB gene results showed a nucleotide identity interval of 95.1% to 100%, except for two isolates of L. anisa and four isolates of L. quateirensis, which were identical to each other with nucleotide identity percentages of 92.4% and 94.5%, respectively. Moreover, for the seven isolates identified by the mip gene as L. feeleii (98.2%), a discrepancy with the rpoB gene identity results was found, showing a percentage of identity of 95.4% for six isolates and 95.1% for one isolate.
To obtain a reliable identification scheme for the rpoB gene in our isolates, it was important to determine the specific intra-and interspecies variation intervals, as has been done for the mip sequence-based classification scheme created by Ratcliff et al. (26). Therefore, our attention was focused mainly on the 14 isolates previously described as having higher discrepancies in nucleotide identity percentage. A pairwise identity matrix for the entire length of the rpoB gene based on 53 reference strains downloaded from NCBI, with a gene size from 4,101 to 4,143 bp, was built ( Fig. 1a and b). The matrix returned an interspecies pairwise identity interval of 72.7% to 95.0%. Therefore, the obtained interspecies variation interval was between 5.0% and 27.3%. The calculated intraspecies identity interval was 95.1% to 100%, resulting in an intraspecies variation interval between 0% and 4.9%, which permits the classification of unknown isolates as belonging to the same species.
A second matrix was built considering only a 329-bp region of the rpoB gene ( Fig. 2a  and b) that was suggested by Ko et al. (31). The matrix returned an interspecies pairwise identity interval of 73.3% to 95.1%. The interspecies variation interval was between 4.9% and 26.7%. The intraspecies identity interval determined was 95.2 to 100%, resulting in an intraspecies variation interval between 0% and 4.8%. As previously described, these values permit the identification of isolates as belonging to the same species.
On the basis of the intra-and interspecies intervals calculated from the 329-bp rpoB gene region identity matrix, we analyzed the results for 14 isolates that showed discrepancies in mip and rpoB gene identification. The two L. anisa isolates determined according to the gold standard mip gene classification scheme were correctly identified; in contrast, the percentage of identity found for rpoB with respect to the reference strains (92.4%) did not fall within the intraspecies identity interval (lower cutoff at 95.1%), thus suggesting the possibility that the strains belong to different species. The same considerations can be applied to the four L. quateirensis isolates, which showed a percentage of identity for the rpoB gene of 94.5%.
The identity values of seven strains of L. feeleii and one isolate of L. quinlivanii, determined according to the mip gene classification scheme, showed borderline results with the rpoB classification scheme based on the observed cutoff values of 95.1% to 95.4% for the presumptive L. feeleii and 95.7% for L. quinlivanii. These findings provide further evidence of their misidentification and the necessity of further investigation.
Moreover, Table 2 reports the nucleotide and amino acid differences in the wild strains with respect to the corresponding reference strains. Interestingly, it is possible to note that all the wild strains presented nucleotide differences in both genes. Despite the rpoB gene being characterized as having greater genetic variability (number of DNA mismatches), the deduced amino acid sequences of the mip gene showed a higher number of amino acid substitutions. It is important to emphasize that all 14 isolates focused on in our study showed few amino acid substitutions in the mip gene, from 1 to 3; in contrast, regarding the rpoB gene, only five amino acid substitutions were reported in L. taurinensis. Figures 3 and 4 display the relationship between all 135 isolates used in the study and the corresponding reference strains for the mip and rpoB genes, respectively. The dendrogram built using the mip and rpoB gene sequences regrouped all isolates into 10 clades corresponding to a specific Legionella species. In the mip gene dendrogram, no relevant differences were found, with the exception of two isolates of L. anisa (MR 54 and MR 97) that were separated from the corresponding main branch, suggesting a possible misidentification of these isolates. In contrast, the dendrogram built using the rpoB gene showed the same 10 clades but with a higher genetic distance between wild types and the reference strains.

DISCUSSION
Several studies have compared molecular methods to detect Legionella spp. in environmental and clinical samples, and it is well known that the amplification and sequencing of some genes for the direct detection and identification of bacteria can be simple, convenient, and specific in their differentiation of bacterial species. The use of PCR methods in Legionella identification and typing, thanks to their species-specific capability, has increased the power to detect and identify species, reducing the time and cost compared to culture and antibody approaches as well as improving the sensitivity and specificity of identification, especially for clinical approaches.
Currently, non-Lp species have been mostly identified by only the mip gene, although several studies have shown that no single system is perfect and that other target genes need to be investigated (27). The use of a particular region of the rpoB gene was already tested to determine phylogenetic relationships as well as the identification scheme for enteric bacteria, Mycobacterium, Bartonella, and other microorganisms (30,35,36). Ko et al. have already shown that a partial region of rpoB is able to discriminate subspecies of Lp and several non-Lp species that have not been differentiated using the mip sequence classification scheme (37). Many of the studies regarding the amplification of rpoB for Legionella spp. identification are exclusively focused on Lp, limiting the knowledge about the presence, distribution, and evolution of non-Lp species in the environment (37)(38)(39). This study showed the steps needed to build a new classification scheme using the rpoB gene and its application to a great number of non-Lp isolates (n = 135) distributed in both nosocomial and community environments. The results obtained were compared with the gold standard mip gene classification scheme already developed by Ratcliff et al. (26)

and still in use by the European Society of Clinical Microbiology and Infectious Diseases (ESCMID) Study Group for Legionella
Infections (ESGLI). Our results confirmed, in agreement with previous studies, that both mip and rpoB are able to discriminate among Legionella species, considering that our isolates (89.6%) showed complete concordance between the two classification schemes. It is important to note that in some cases, there was no concordance between the mip and rpoB results, as there was a low percentage of genomic identity with respect to the reference strains for 14 isolates. In detail, our results suggest that sequencing using only rpoB is able to detect relevant genetic differences between the wild-type and the reference strains, which would otherwise be undetected using only the mip approach (e.g., L. feeleii, L. anisa, L. quinlivanii, and L. quateirensis). This result is especially interesting given that L. quateirensis and L. anisa showed a variability percentage for the rpoB gene outside the intraspecies interval of variation found here (0 to 4.8%); L. feeleii and L. quinlivanii had values very close to the variation cutoff, suggesting that the identification scheme using only one gene limits the discovery and study of species variation and sometimes limits discrimination between different species. In line with previous results, all the dendrogram representations show that there is lower genetic diversity in the mip gene between and within the clades; in contrast, the diversity in rpoB appears to be greater, leading to the identification of several isolates that showed evident differences from their respective clade or reference strain (e.g., L. anisa, L. feeleii, etc.). The results obtained using the rpoB gene seem to be useful for the identification of non-Lp species, and the results obtained permit the construction of the first rpoB gene classification scheme in the scientific literature.
Thanks to the matrices described above, we built pairwise identity intervals that allowed us to classify our unknown sequences based on comparisons with reference strains. The comparison carried out using the values obtained here seems reliable, and we propose that they be used in a classification scheme. For strains whose similarity percentages are very close to the cutoff values, further in-depth analyses are recommended. Based on the intervals of variation derived from the pairwise identity matrices, the discriminatory power of the 329-bp target region for the non-Lp species appears to be as reliable as that of the entire gene.
The comparison between the two matrices shows that the variability in the entire gene is greater than that in the selected region, suggesting that the analysis of a larger portion of the genome could increase the discriminatory power; therefore, approaches using new sequencing strategies, such as whole-genome sequencing (WGS), could contribute to better clarifying the identification of our isolates. This approach has already been applied to the four isolates of L. quateirensis described here. The average nucleotide identity (ANI) analysis, performed comparing their entire genome and the L. quateirensis type strain, showed pairwise values below the similarity threshold fixed to 95%, validating the hypothesis that the four strains belong to a presumptive new Legionella species (40).
In terms of the number of DNA and amino acid mismatches, most variability in the number of amino acid substitutions was observed in the mip gene, as all reported isolates showed discrepancies regarding the identification scheme based on the mip and rpoB genes. The role of the mip gene is widely documented; it is involved in the ability of L. pneumophila to replicate in eukaryotic cells and environmental amoebae (41). The substitutions found could interfere with pathways influenced by mip, as documented for Lp as well as for some non-Lp species (42)(43)(44).
It is possible to observe that the rpoB gene displayed a high number of DNA mismatches with a low number of amino acid variations. This result could be explained by the fact that rpoB is a housekeeping gene and that the alteration in the amino acid sequence could interfere with rifampicin resistance, as already demonstrated in other bacteria (e.g., Mycobacterium tuberculosis) and in a few Legionella species (39,45). Therefore, the five amino acid mismatches found in L. taurinensis indicate the need to study the role of these variations in terms of protein function. Further investigations on in silico protein modeling and structural prediction other than biochemical functionality studies might contribute to better clarifying the role of these amino acid alterations and their evolution in Legionella species.
Although the non-Lp classification scheme using single-gene identification, such as the mip gene, is widely used and approved, the identification scheme for Legionella requires an update, such as introducing several patterns from various genes so as to increase the power of identification and improve phylogenetic studies. Especially for routine clinical and environmental laboratories where the whole-genome approach is expansive and laborious, the introduction of an easy, less expensive, and more sensitive scheme of identification could avoid errors in species characterization. Moreover, the proposed identification scheme could represent the first step toward acquiring information on different characteristics of isolates, such as changes in and development of antibiotics or disinfectant resistance, avoiding the failure of routine tests (e.g., urinary antigen test, serological and antibody-based assays), inadequate antibiotic treatments in human infection contest (e.g., rifampicin, fluoroquinolone, macrolides), and disinfection treatment. If a discrepancy is observed in this first step, then a more advanced technology, such as WGS, can be applied. This combined strategy represents an improved screening approach for Legionella isolate identification.

MATERIALS AND METHODS
The isolates involved in this study come from Legionella environmental surveillance programs of several facilities commonly associated with the risk of Legionella infections, including hospitals, companies, and communities (e.g., hotels, private apartments).
Legionella culture and isolate selection. The Legionella culture technique was based on ISO 11731:2017 (20). The hot-and cold-water samples were sampled following the Italian National Unification and European Committee (UNI EN) ISO 19458:2006 (46) and Italian guidelines (19).
Different aliquots (from 0.2 to 0.1 mL) of the untreated, filtered, heated, and acid-treated samples were seeded on plates of the selective medium glycine-vancomycin-polymyxin B-cycloheximide (GVPC) (Thermo Fisher Scientific, Diagnostic, Ltd., Basingstoke, UK) and incubated at 35 6 2°C with 2.5% CO 2 for a maximum of 15 days. Legionella growth was evaluated every 2 or 3 days.
To confirm the presence of the Legionella genus, suspected colonies were subcultured on buffered charcoal yeast extract (BCYE) agar with (cys) and without (cys) L-cysteine (L-cys) supplementation (Thermo Fisher Scientific, Diagnostic, Ltd., Basingstoke, UK). Moreover, as a negative control, the same isolates were spread on tryptone soy agar (TSA) with 5% sheep blood agar (Thermo Fisher Scientific, Diagnostic, Ltd., Basingstoke, UK) and incubated under the same conditions previously described, as Legionella is not able to grow on this medium. Only the colonies that grew on BCYE cys 1 agar were considered for the next steps of the study.
Serological and biochemical typing. The predicted Legionella colonies were then identified using the Legionella latex agglutination test (Legionella latex test kit, Thermo Fisher Scientific, Diagnostic, Ltd., Basingstoke, UK), which is able to distinguish between Lp and non-Lp. In particular, among Lp, it is possible to identify serogroup 1 (Sg1) from Sg2 to Sg14, while among non-Lp, it is possible to recognize only some non-Lp, such as L. anisa, L. bozemanii 1 and 2, Legionella gormanii, L. longbeachae 1 and 2, L. dumoffii, and L. jordanis. A total of 134 strains of non-Lp and 1 strain of Lp that was previously typed by sequence-based typing (SBT) and included as a positive control were selected for the study.
Identification of Legionella spp. by mip and rpoB gene sequencing. The DNA of each strain was extracted using the InstaGene purification matrix (Bio-Rad, Hercules, CA), and DNA concentrations were determined using a Qubit fluorometer (Thermo Fisher Scientific, Paisley, UK). PCR analysis for all non-Lp isolates was performed to determine the gene sequences of mip and rpoB as described by Ratcliff et al. (26) and Ko et al. (31), respectively. mip gene amplification was performed using degenerate primers and modified by M13 tailing to avoid noise in the DNA sequence (47). mip gene amplification was performed in a 50-mL reaction mixture containing DreamTaq Green PCR master mix 2Â (Thermo Fisher Diagnostics, Basingstoke, UK) and 40 pmol of each primer; 100 ng of the DNA extracted from the presumptive colonies was added as the template. The mip amplicons were sequenced using tailed M13 forward and reverse primers (mip-595R-M13R caggaaacagctatgaccCATATGCAAGACCTGAGGGAAC and mip-74F-M13F tgtaaaacgacggccagtGCTGCAACCG ATGCCAC) to obtain complete coverage of the region of interest (47). Amplification was performed in a thermocycler under the following conditions: predenaturation for 3 min at 96°C, then 35 cycles consisting of 1 min at 94°C for denaturation, 2 min at 58°C for annealing, and 2 min at 72°C for extension, followed by a final extension at 72°C for 5 min. The reaction mixtures were then held at 4°C.
rpoB gene amplification was performed as described by Ko et al. (31). Gene amplification was performed in a 50-mL reaction volume containing 100 ng of template DNA, 40 pmol of each primer (RL1 59-GATGATATCGATCAYCTDGG-39; RL2 59-TTCVGGCGTTTCAATNGGAC-39), 1 U of Taq polymerase, and a PCR mixture consisting of PCR buffer 10Â, 1.5 mM MgCl 2 , and 250 mM deoxynucleoside triphosphates (dNTPs). The thermal cycles consisted of 35 cycles, and each cycle consisted of 30 s at 94°C for denaturation, 30 s at 55°C for annealing, and 30 s at 72°C for extension, followed by a final extension at 72°C for 10 min. PCR products were visualized by electrophoresis on a 2% agarose gel and stained with ethidium bromide. Following purification, DNA was sequenced using BigDye chemistry and analyzed on an ABI PRISM 3100 genetic analyzer (Applied Biosystems, Foster City, CA). Raw sequencing data were assembled using CLC Main Workbench 7.6.4 software.
The mip sequences were compared to sequences deposited in the Legionella mip gene sequence database using a similarity analysis tool. EWGLI has established an accessible web database (http:// bioinformatics.phe.org.uk/cgi-bin/Legionella/mip/mip_id.cgi) that contains sequence data from described species and allows for the identification of non-Lp species. Species-level identification was performed on the basis of a similarity score of 98 to 100% compared to the sequences in the database (27) and considering the intra-and interspecies intervals of variation previously described by Ratcliff et al. (26).
The rpoB sequences were compared to type strain sequences deposited in NCBI from several culture collections, including the American Type Culture Collection (ATCC), National Collection of Type Cultures, Central Public Health Laboratory (NCTC), NITE Biological Research Center, National Institute of Technology and Evaluation (NBRC), and Deutsche Sammlung von Mikroorganismen und Zellkulturen (DSM). According to Adékambi et al. and Ko et al., the cutoff used for rpoB gene sequence-based identification was fixed at a 94 to 95% similarity percentage using an rpoB gene fragment of 300 to 600 bp (31,48).
Elaboration of matrices for the rpoB gene: definition of the intra-and interspecies intervals of variation. Legionella type strains (n = 53) retrieved from the NCBI, were used to determine the ranges of the intra-and interspecies intervals of variation for the rpoB gene, resulting in a pairwise identity matrix for the entire gene with a length from 4,101 to 4,143 bp ( Fig. 1a and b) and for the 329-bp selected region ( Fig. 2a and b), corresponding to the amplicon suggested by Ko et al. (31). The list of type strains used in the study is reported in Table 3.
The matrices were built using the multiple sequence comparison by log-expectation (MUSCLE) program (49) in Geneious Prime 2021.1.1 (https://www.geneious.com), retaining the default settings. The on the mip and rpoB gene sequences. For each taxon identified as previously described, the reference mip and rpoB gene sequences of the corresponding type strains from several culture collections were retrieved and added to the analysis (Table 3). When required, manual editing was performed on the sequences, trimming them to the same length as the reference sequence. The nucleotide sequences were aligned by the MUSCLE program. The obtained MSA was passed to FastTree (50), a tool for inferring approximate maximum likelihood phylogenetic trees. FastTree uses Jukes-Cantor as a genetic distance model and the Shimodaira-Hasegawa test to estimate the reliability of each split in the tree (51). Branch lengths were transformed to be equal, as in a cladogram. Branch labels display the substitutions per site. Both MUSCLE and FastTree were performed in Geneious Prime 2021.1.1 (https://www.geneious.com), retaining the default settings.
Data availability. The GenBank accession numbers of sequences generated during this study are listed in Table 4.