Comparative Genomic Analysis Reveals Genetic Diversity and Pathogenic Potential of Haemophilus seminalis and Emended Description of Haemophilus seminalis

ABSTRACT Haemophilus seminalis is a newly proposed species that is phylogenetically related to Haemophilus haemolyticus. The distribution of H. seminalis in the human population, its genomic diversity, and its pathogenic potential are still unclear. This study reports the finding of our comparative genomic analyses of four newly isolated Haemophilus strains (SZY H8, SZY H35, SZY H36, and SZY H68) from human sputum specimens (Guangzhou, China) along with the publicly available genomes of other phylogenetically related Haemophilus species. Based on pairwise comparisons of the 16S rRNA gene sequences, the four isolates showed <98.65% sequence identity to the type strains of all known Haemophilus species but were identified as belonging to H. seminalis, based on comparable phenotypic and genotypic features. Additionally, the four isolates showed high genome-genome relatedness indices (>95% ANI values) with 17 strains that were previously identified as either “Haemophilus intermedius” or hemin (X-factor)-independent H. haemolyticus and therefore required a more detailed classification study. Phylogenetically, these isolates, along with the two previously described H. seminalis isolates (a total of 23 isolates), shared a highly homologous lineage that is distinct from the clades of the main H. haemolyticus and Haemophilus influenzae strains. These isolates present an open pangenome with multiple virulence genes. Notably, all 23 isolates have a functional heme biosynthesis pathway that is similar to that of Haemophilus parainfluenzae. The phenotype of hemin (X-factor) independence and the analysis of the ispD, pepG, and moeA genes can be used to distinguish these isolates from H. haemolyticus and H. influenzae. Based on the above findings, we propose a reclassification for all “H. intermedius” and two H. haemolyticus isolates belonging to H. seminalis with an emended description of H. seminalis. This study provides a more accurate identification of Haemophilus isolates for use in the clinical laboratory and a better understanding of the clinical significance and genetic diversity in human environments. IMPORTANCE As a versatile opportunistic pathogen, the accurate identification of Haemophilus species is a challenge in clinical practice. In this study, we characterized the phenotypic and genotypic features of four H. seminalis strains that were isolated from human sputum specimens and propose the “H. intermedius” and hemin (X-factor)-independent H. haemolyticus isolates as belonging to H. seminalis. The prediction of virulence-related genes indicates that H. seminalis isolates carry several virulence genes that are likely to play an important role in its pathogenicity. In addition, we depict that the genes ispD, pepG, and moeA can be used as biomarkers for distinguishing H. seminalis from H. haemolyticus and H. influenzae. Our findings provide some insights into the identification, epidemiology, genetic diversity, pathogenic potential, and antimicrobial resistance of the newly proposed H. seminalis.

shared a highly homologous lineage, which is distinct from the main H. haemolyticus and H. influenzae clades (Fig. 2). These 23 isolates demonstrated DNA-DNA relatedness values of 95.24 to 100% ANI and 63.00 to 100% digital DNA-DNA hybridization (dDDH) to each other (Table S3). They had the next highest genome relatedness indices to the isolates belonging to the main H. haemolyticus clade, 92.16 to 93.83% ANI values (Fig. S2). The genome size of the 23 isolates ranges from 1.78 Mb to 2.06 Mb, with GC contents of 38.1 to 38.4 mol%. The complete genome of strain SZY H68 was circular with a genome size of 1,902,920 bp. In total, 1,828 predicted genes were annotated, including 1,731 protein-coding genes, 81 encoded RNAs (58 tRNA genes, 4 ncRNA genes, 7 copies of 5S rRNA genes, 6 copies of 16S rRNA genes, and 6 copies of 23S rRNA genes), and 16 pseudogenes. No plasmids were detected in the genome of strain SZY H68. , the four isolates obtained in this study (SZY H8, SZY H35, SZY H36, SZY H68) were considered to be a putative novel species of the genus Haemophilus. For further comparative studies, isolate SZY H68 was selected as the representative strain. The six copies of 16S rRNA genes extracted from the genome (accession number: CP091469) also showed identical sequences with considerable divergence (,98.65% similarities) to the type strains of valid-published species. Further, the two H. seminalis strains, SZY H1 T and SZY H2, and the H. haemolyticus strain CCUG 15949 formed a single subclade, with 99.78 to 100% sequence similarity within themselves (Table S4).
The phylogenetic analysis for these Haemophilus isolates was based on the 23S rRNA gene and the previously described single-copy housekeeping genes such as rpoB, recA, frdB,  mdh, adk, atpG, and pgi (16). The basic topology of the phylogenetic tree created on the basis of these genes also showed that all the H. seminalis, H. haemolyticus, and "H. quentini" isolates were clustered together with a big clade, which was quite similar to that of the 16S rRNA gene. Therefore, the single gene analyses of 16S rRNA, 23S rRNA, and the above housekeeping genes were unable to distinguish H. seminalis from H. haemolyticus and "H. quentini" (Fig. S3). A further search for effective marker genes was also performed based on pan/core genome analysis. The genes ispD, pepG, and moeA retrieved from these 173 Haemophilus spp. genomes were found to be conserved and were selected for comparative analysis. Comparison of ispD gene sequences showed that 23 isolates clustered together and had a 97.35 to 100% sequence similarity, which is distinct from H. haemolyticus and H. influenzae clades (Fig. 3) (Table S5). We also performed a phylogenetic analysis for these Haemophilus isolates based on the pepG and moeA genes. The dendrograms revealed that the phylogenetic positions of the isolates were similar to that of the ispD gene (Fig. S4). However, the resolving power of the ispD gene was found to be relatively better than those of the pepG and moeA genes. Pangenome analyses. The pangenome of all 23 H. seminalis strains comprised 3,934 genes, including 1,408 core genes (present in 99 to 100% of genomes), 53 softcore genes (present in 95 to 99% of genomes), 676 shell genes (present in 15 to 95% of genomes), and 1,797 cloud genes (present in 0 to 15% of genomes) (Fig. 4A). For the core genes, 84.7% (1,193/1,408) encoded a known function, while only 15.3% (215/1,408) were assigned to hypothetical proteins. The accessory genome comprised 676 shell (17.2%) and 1,797 cloud (45.7%) genes. Remarkably, 76.5% (1,892/2,473) of the accessory genes were considered to code for hypothetical proteins, and 54.3% of the predicted genes (2,137/3,934) have no known function (Data Set S1).
The clusters of orthologous groups (COG) analysis showed that the core genome proteins were enriched in translation, ribosomal structure and biogenesis (J), energy production and conversion (C), amino acid transport and metabolism (E), and inorganic ion transport and metabolism (P) genes. This indicates that these genes were more critical to the survival of H. seminalis and were hence more conserved. The proteins of the transcription (K), replication, recombination, and repair (L), and the cell wall, membrane, and envelope biogenesis (M) formed the majority of the cloud genome. Significantly, the pangenome had a high proportion of cloud genes with hypothetical proteins. However, these unknown functional genes with limited distribution contributed to the genetic diversity of H. seminalis, and their biological roles require further research ( Fig. S5; Data Set S1).  The pangenome and core genome of the analyzed genomes of H. seminalis are illustrated in Fig. 4B. As can be seen, the size of the core genome converged, and the pangenome size gradually expanded without reaching a plateau with the addition of new genomes. Furthermore, the fitting parameter B of the curve is greater than 0 and less than 1, which proved that the pangenome of H. seminalis is open. An open state of bacterial pangenome reflects the diversity within the gene pool of a given species.

Haemophilus influenzae 11P6H
Potential virulence-associated gene prediction analyses. The virulence genes were predicted using a Virulence Factor Database (VFDB) analysis (17). The putative virulence genes of these 23 H. seminalis strains are presented in Fig. 5, and these genes were mainly associated with antiphagocytosis, adherence, immune evasion, endotoxin, and iron uptake (Data Set S1). Besides, the capsule-encoding genes bexD, hcsA, and ccs1, responsible for the synthesis of the polysaccharide capsule, were detected only in strain 60824 B Hi-4 (  of respiratory tract source. These genes were not detected in any of the six isolates from Guangzhou, nor were their capsular structures observed via transmission electron microscopy (Fig. S1).
Adherence to the host tissue is essential for bacterial colonization and subsequent infection (18). We observed that the hemagglutinating pili locus (hifABCDE genes), which is associated with biofilm formation (19), was not found in any of the 23 H. seminalis genomes. The hap gene, which encodes the adhesion and penetration Hap protein, was found in 4 of 23 (17.4%) H. seminalis genomes. The genes hmw1c, oapA, ompP2, ompP5, comE/pilQ, and pilABCD, which contribute to adhesion to host tissue, were also found in all 23 H. seminalis genomes. These genes encode HMW, OapA, OMP P2, OMP P5, and type IV pili, respectively. Notably, the multipurpose proteins of HMW, OMP P2, and OMP P5, with roles in adhesion and biofilm, also play an important role in internalization and immune evasion (20).
Immune exclusion is the most important defense mechanism involved in protection of mucosal membranes. The gene iga1 encoding immunoglobulin A1 (IgA1) proteases, which is reported to be significantly correlated with the chronic obstructive pulmonary disease of NTHi isolates (21,22), was found in all 23 H. seminalis genomes.
Despite limited research, exopolysaccharide (EPS) is also reported to be a component of the biofilm matrix for NTHi isolates (25). These genes (manA, mrsA/glmM, and pgi genes), which function in EPS synthesis, were found in most of the 23 H. seminalis genomes.
To promote persistence during infection, Haemophilus possesses the mechanisms to acquire iron and heme complex (26). The hitABC operon encoding a periplasmic-binding protein-dependent iron transport system was present in all H. seminalis, which is necessary for utilizing iron bound to transferrin or iron chelates (26). Notably, all H. seminalis strains have heme biosynthesis genes: hemA, hemB, hemC, hemD, hemE, hemG, and hemN. The derivatives of iron uptake genes, such as the hgpABC gene encoding a hemoglobin-binding protein, the TbpAB genes encoding a transferrin-binding protein, and the HxuABC genes encoding hemoglobin-binding complex, which are vital for acquiring iron and heme for growth, were not found in most of the 23 H. seminalis genomes.

DISCUSSION
The genus Haemophilus is a heterogenous group and has been reported as a normal inhabitant of the upper respiratory tract of humans, the oral cavity, and mucous membranes (1)(2)(3)(4)(5)(6)(7)(8)(9)(10)(11). The new entry among the Haemophilus isolates is the description of two isolates with the name H. seminalis to indicate its isolation from the clinical semen samples of two patients who were suspected to be infected with Neisseria gonorrhoeae (9). This study further adds to the repertoires of Haemophilus isolates. This study provides a systemic characterization of four strains (SZY H8, SZY H35, SZY H36, and SZY H68) isolated from human sputum belonging to H. seminalis. Despite the low 16S rRNA gene sequence identity (,98.65%) to H. seminalis SZY H1 T , they were classified to H. seminalis based on the comparative studies of phenotypic (Tables 2 and 3) and genotypic features, (e.g., ANI [.95%] values). To gain more insight into the H. seminalis strains, a comparative genomic analysis was conducted for 173 Haemophilus genomes retrieved from the NCBI GenBank database. These genomes include isolates previously identified as H. seminalis, H. haemolyticus, "H. intermedius", "H. quentini", and H. influenzae. 17 strains with .95% ANI values to SZY H1 T were finally identified, of which 14 were previously described as "H. intermedius" or hiHh (11).
Historically, "H. intermedius" is informally named due to the intermediate position between H. influenzae and H. parainfluenzae (14). Although the previous study reclassified "H. intermedius" as hiHh for their high genetic similarity to H. haemolyticus (11), we feel that the "H. intermedius" or hiHh should be classified as H. seminalis. First, the name "H. intermedius", was proposed at more or less the same time as H. seminalis and was not validated. Therefore, it was not recognized by ICSP, whereas the name H. seminalis is described as per the existing rules and regulations of ICNP. Second, the phylogenomic analyses demonstrated that the 14 "H. intermedius" or hiHh isolates shared a highly homologous lineage with H. seminalis isolates, and this lineage is distinctive from the clades of the main H. haemolyticus group and that of H. influenzae isolates (Fig. 2). Third, the 14 "H. intermedius" or hiHh isolates shared the same heme biosynthesis genes in their core genomes, similar to those of the H. seminalis isolates, but different from those of H. haemolyticus and H. influenzae. Therefore, we propose the reclassification of the isolates identified as "H. intermedius", or hiHh as belonging to H. seminalis. Accordingly, we propose an emendation of the description of the species of H. seminalis.
In recent years, a clear increase in urethral infections by Haemophilus spp. was observed, and the main route of transmission was reported to be unprotected oral sex and intercourse in men who have sex with men (7,15,(27)(28)(29)(30). Several Haemophilus species, for example, H. parainfluenzae, H. influenzae, "H. quentini", H. pittmaniae, and H. sputorum have been reported to be the etiological agent of nongonococcal urethritis, of which the most common one was identified as Haemophilus parainfluenzae (7,27,29,30). As the first isolation of H. seminalis was from the clinical semen samples, we hypothesized that these isolates are other pathogens responsible for causing reproductive tract infection. Unlike the earlier cases, the four isolates obtained in this study were from sputum samples of four immunocompromised patients with suspected tuberculosis, contact dermatitis, malignant tumor of the stomach, and nephrotic syndrome. Considering the relatedness of these four isolates to SZY H1 T from semen, as well as other isolates from bronchoalveolar lavage fluid, urine, eye, cerebrospinal fluid, feces, ascitic fluid, and pleural fluid, and its ability to grow like the other normal microbiota of upper respiratory tract, we define H. seminalis as a part of the human commensal flora that can cause invasive infections.
The comparative genomic analysis also reveals the genetic diversity and pathogenic potential of H. seminalis isolates. First, H. seminalis has an open pangenome. Microbes with an open pangenome and having large and diverse accessory genomes indicate that species can migrate to new niches and adapt to changing environmental conditions (31,32). Second, H. seminalis was found to possess most of the virulence genes of H. influenzae, which implied a pathogenic potential similar to H. influenzae at a genomic level. For example, hmw1c, oapA, ompP2, ompP5, comE/pilQ, and pilABCD genes, which contributed to adhesion to host tissue for most NTHi isolates (33)(34)(35)(36)(37), were found in all the 23 H. seminalis genomes. The iga1 gene that encoded specific IgAl proteases that cleave human IgA1, which is a significant protective gene of NTHi isolates (21,22), was also found in all 23 H. seminalis genomes. The same goes for the series of genes that function in LOS and EPS synthesis; they are reported in most H. influenzae isolates (18,21,38) and were also found on almost all 23 H. seminalis genomes. However, the cellular capsule encoding locus of the bexABCD, hcsA, and hcsB genes, which is a major virulence factor of encapsulated typeable H. influenzae (39), was not present in the majority of the 23 H. seminalis genomes. As the cellular capsule was not observed, we infer that H. seminalis may represent a lineage with low virulence that is similar to H. parainfluenzae, H. haemolyticus, and nontypeable H. influenzae.
H. influenzae is known to acquire genetic material from other Haemophilus spp. (particularly H. haemolyticus), other unrelated species, and even host cells (12,40,41). Similarly, H. seminalis also has the competence to acquire genetic material from other species. The type IV pilus genes (pilABCD), which are necessary for competence, were found in all 23 H. seminalis genomes. The previous study demonstrated that the hemin biosynthesis loci acquired in the H. seminalis lineage were likely laterally transferred from an H. parainfluenzae ancestor (11).
Although the majority of Haemophilus isolates are noninfectious commensals, precise identification of H. seminalis among so many Haemophilus species is important when it grows dominantly in a valid sputum specimen. However, it is hard to distinguish H. seminalis from H. haemolyticus and H. influenzae using single phenotypic and genotypic methods. We attempted to assess H. seminalis intraspecies differences based on the 16S rRNA gene, 23S rRNA gene, and the housekeeping genes of rpoB, recA, frdB, mdh, adk, atpG, and pgi. The results show that these genes did not contain the sufficient differential capacity to effectively identify H. seminalis from H. haemolyticus and H. influenzae. Intraspecies classification was therefore undertaken using a whole-genome-based approach. Fortunately, the ispD gene was found to be a reliable phylogenetic marker for these three related species based on intraspecies comparisons. The rapid and reliable MALDI-TOF MS method for H. seminalis is also expected to be explored for species identification in clinical microbiology laboratories.
In this study, a pangenome analysis revealed that some housekeeping genes, such as fabB, fabI, pal, ligA, folP, and rfaF, were missing in individual strains of H. seminalis. However, the absence of these genes may be related to the limitations of whole-genome sequencing technology. Since short-read sequencing technology was used, it is likely that it influenced the number of contigs and the quality of the whole-genome assembly, leading to the missing of certain protein-coding genes in the genome analyses.
In conclusion, our study provides insights into the identification, epidemiology, genetic diversity, pathogenic potentials, and antimicrobial resistance of H. seminalis. Notably, the genome numbers in this study are not large enough to define a total accessory genome of H. seminalis and the addition of more high-quality genomes will probably increase the size of the accessory genes. Overall, a further epidemiological investigation is also expected to explore H. seminalis infection in the future.

TAXONOMY
Emended description of Haemophilus seminalis by Zheng et al., 2020. The isolates previously described as "H. intermedius" or hiHh should be included with the species description, as given by Zheng et al. (9), with the following amendments.
The cells were pleomorphic rods or coccobacilli. All tested strains grew between 28 to 40°C, although some isolates can grow at 42°C. The V-factor was dependent, but the X-factor was independent. Positive for catalase activity. Negative for motility. Varied for oxidase activity, hemolytic activity, nitrite reduction, and indole production. The G1C content ranged from 38.1% to 38.4%.

MATERIALS AND METHODS
Strains and genomes. Strains SZY H1 T and SZY H2 were isolated from clinical human semen in our previous study (9). Four isolates, designated SZY H68, SZY H8, SZY H35, and SZY H36, were obtained from clinical sputum specimens of four hospitalized patients that were collected between September of 2018 and March of 2019 (Table 1). The genomic DNA of the four isolates was extracted using a bacterial genomic DNA extraction kit (AG, China), according to the manufacturer's protocols, and then the whole-genome sequencing was performed by the Novogene Company (Beijing, People's Republic of China).
Phenotypic tests. Gram-staining and acid-fast staining were carried out according to standard operating procedures. Morphological traits were observed by using transmission electron microscopy (JEM-1400, JEOL). H. influenzae ATCC 10211 was used as a capsulated control. Growth activity at 4,8,14,28,32,35,37,40, and 42°C was observed by culturing the isolates on Haemophilus chocolate 2 agar at 35°C and 5% CO 2 . The requirements for X-factors and/or V-factors were assayed by using commercial discs supplemented with X-factors, V-factors, and X1V-factors (Liofilchem) on BHI agar. The satellite phenomenon and hemolytic activity were tested by observing the growth of the isolates around Staphylococcus aureus ATCC 25923 on Columbia blood agar. Traditional biochemical analyses, including the catalase, oxidase, arginine dihydrolase, urease, tryptophan decarboxylase, gelatinase, lysine decarboxylase activities, Simmons' citrate, Voges-Proskauer, indole production, H 2 S production, hydrolysis of esculin, and reduction of nitrate, were assessed as described by Aslanzadeh (47). Polar lipids were extracted, separated, and analyzed via two-dimensional TLC, according to the method of Minnikin et al. (48).
Antimicrobial susceptibility testing was carried out on HTM agar (Detgerm, China) and a TDR NH-96 kit (Mindray, China). The susceptibility criterion was interpreted by referring to disc diffusion and MIC breakpoint using the MIC breakpoint of H. influenzae and H. parainfluenzae as the standard, as listed in the CLSI M100-S31 guidelines (Performance Standards for Antimicrobial Susceptibility Testing; Thirtyone Informational Supplement).
rRNA and protein-encoding gene analysis. The 16S rRNA genes of the four isolates were amplified and sequenced as previously described (9,60). Meanwhile, the targeted rRNA and protein-encoding genes were extracted from the genome using RNAmmer (version 1.2) (61) and in-house JavaScript, respectively. The sequence comparison with those related strains was performed on the NCBI BLAST website (https://blast.ncbi.nlm.nih.gov/Blast.cgi). A maximum likelihood tree based on the nucleotide sequences was constructed for each of the mentioned genes. Multiple alignments were performed using MUSCLE (62). Poorly aligned regions were removed from the data sets via BMGE (63). A maximum likelihood (ML) phylogenomic tree was generated by the IQTree2 software (64), which automatically calculates the best-fit model. Bootstrap values were determined for 1,000 replications. Tree visualization was performed using iTOL (65).
Pangenome analyses. Through gff files produced by Prokka, pangenomes were deduced by using the pangenome pipeline Roary (66). Two sets of pangenomes were created. The first set contained all H. seminalis genomes. Pangenome analysis was performed on these strains genomes by using Roary software with the default blastP identity cutoff of 85%. The second contained 23 H.seminalis, 50 H.haemolyticus, 2 "H.quentini", and 98 H. influenzae genomes. The core gene nucleotide sequences of these strains' alignments were performed with Roary using PRANK (with MAFFT), and a blastP identity cutoff of 60%. FastTree v2.1.9 (67) was used to calculate maximum likelihood phylogenetic trees, based on these core gene alignments. The final tree was visualized using iTOL (65).
Related analysis data of the pangenome and core genome were obtained by using the method proposed by Knight et al. (68) and the models given by Tettelin et al. (69,70). The curve fitting of the pangenome was performed using a power law regression based on Heaps' Law (y = Ax B 1 C), as previously described (69)(70)(71), where y is the pangenome size, x is the number of genomes, and A, B, and C are the fitting parameters. When 0 , B , 1, the size of the pangenome increases continuously with the sequential additions of new genomes, which indicates that the pangenome is open. In contrast, when B , 0 or B . 1, this shows that the pangenome is closed. Similar to the pangenome plot, the number of core genes after the addition of each new genome was plotted as a function of the number of genomes added in order. The exponential curve fit model (y = Ae Bx 1 C) was used to fit the core genome (69)(70)(71), where y denotes the core genome size, x is the number of genomes, and A, B, and C are the fitting parameters. Both the pangenome and the core genome were visualized via PanGP (72).
Virulence gene and antimicrobial resistance gene analysis. The prediction of virulence genes was performed through searches querying each genome against the Virulence Factor Database (17) with the VFanalyzer web server. The default parameters were used. The results were visualized by using Chiplot. The detection of the capsule genes was performed by BLAST against a custom database of capsule genes, which was created based on a previous study (39). The resistance genes were identified using ABRicate v1.0.1 (https://github .com/tseemann/abricate) with the CARD database (73) and the Antibiotic Resistance Genes Database (ARDB) (http://ardb.cbcb.umd.edu/). Data availability. The 16S rRNA of the SZY H8, SZY H35, SZY H36, and SZY H68 were deposited into GenBank with the accession numbers MZ191769 to MZ191772. The complete genome of SZY H68 and draft genomes of SZY H8, SZY H35, and SZY H36 have been deposited at DDBJ/ENA/GenBank under the accession numbers CP091469, JAHKJY000000000, JAHKJZ000000000, and JAHKKA000000000 (Table 1).

SUPPLEMENTAL MATERIAL
Supplemental material is available online only. SUPPLEMENTAL FILE 1, XLSX file, 1.4 MB. SUPPLEMENTAL FILE 2, PDF file, 13.7 MB.