Prediction of Mannheimia haemolytica serotypes based on whole genomic sequences

The aim of the investigation was to predict the serotypes of M. haemolytica based on whole genomic sequences with the capsular gene region as target. A total of 22 strains selected to have been serotyped and to represent all serotypes were investigated by whole genomic sequencing. The BIGSdb (Bacterial Isolate Genome Sequence Database) was downloaded and installed on a Linux server. Here the sequence database was setup with unique loci at serotype level. The server allows serotypes of M. haemolytica to be predicted from whole genomic sequences and the service is available to the public for free from https://ivsmlst.sund.root.ku.dk.


Introduction
Mannheimia haemolytica (Angen et al., 1999a) (formerly known as [Pasteurella] haemolytica) is part of the commensal microflora of ruminants residing in nasopharynx and tonsils (Rice et al., 2007). From here, M. haemolytica can proliferate and start an infection and become responsible for fibrinous pleuropneumonia (Frank, 1988). M. haemolytica is involved in bovine respiratory disease (BRD) (also referred as shipping fever), and BRD is a major problem in feedlot cattle production as it reduces performances and animal welfare (Jeyaseelan et al., 2002;Timsit et al., 2013). M. haemolytica is also an important pathogen in sheep and goats causing septicaemia of young animals and pneumonia of animals at all ages (Gilmour, 1980). Reports of mastitis in sheep caused by M. haemolytica have been published more recently including the potential for horizontal transmission to lambs via suckling (Omaleki et al., 2011(Omaleki et al., , 2016. Since the late 1950s, bacterins + toxoids have been used for prevention of bovine pneumonia caused by M. haemolytica (Mosier et al., 1989). Serotyping is crucial for selection of vaccine strains, as killed vaccines protect mostly against homologous serotypes (Gilmour et al., 1979). A serotyping system based on indirect haemagglutination (IHA) of soluble antigens was developed (Biberstein et al., 1960;Fraser et al., 1983). The serotypes 1, 2, 5, 6-9, 11-14, 16, 17 were found associated with biovar A of [P.] haemolytica (A for acid formation from L-arabinose) while the serotypes 3, 4, 10 and 15 were associated with biovar T (T for acid formation from trehalose) (Biberstein and Gills, 1962;Fraser et al., 1982;Fodor et al., 1988;Younan and Fodor, 1995). Strains of serotypes 1, 2, 5, 6-9, 12-14, and 16 were subsequently included with M. haemolytica as a consequence of its reclassification. Serotypes of M. haemolytica are sometimes designated as A1, A2, etc., as a reminiscence of the combined biovar and serotype designation. Representatives of serotype A11 were included with M. glucosida (Angen et al., 1999a), while members of biovar T were classified as Bibersteinia trehalosi (Blackall et al., 2007).
Serotyping is currently performed by rapid plate or slide agglutination with antisera prepared in rabbits against formalin killed whole cells of the serotype reference strains (Frank and Wessman, 1978;Klima et al., 2011Klima et al., , 2017. A major problem with traditional serotyping is limited access to reference strains and antibodies as well as the presence of not typable isolates (Quirie et al., 1986). Donachie et al. (1984) identified nine serogroups of not typable (UT) strains based on counter current immunoelectrophoresis tests.
Early attempts to identify the serotypes by DNA techniques were performed by gene probing with the leukotoxin gene (Burrows et al., 1992). Later, PCR primers were designed based on comparative genomics and targets were identified in genes encoding for the capsule or LPS biosynthesis or hypothetical genes resulting in a multiplex PCR for molecular serotyping of serotypes 1, 2 and 6 (Klima et al., 2017). However, when we tested this PCR, several cases of cross reactions with the reference serotypes were found (see supplementary Fig. S1 as an example). A fully molecular based system to predict all serotypes still remains to be developed.
The aim of the investigation was to predict the serotypes of M. haemolytica based on the variable central part of the capsular gene region as target (Fig. 1). The serotype prediction was based on whole genomic sequences whose analysis have become affordable for routine identification and typing. The online genomic comparisons and prediction of serotypes are done by BIGSdb (Bacterial Isolate Genome Sequence Database) which is a scalable, open source, web-accessible database system that enables comparisons of phenotypic and sequence data (Jolley and Maiden, 2010). BIGSdb is setup with a sequence database and an isolate database. BIGSdb is written in the computer program Perl for installation on UNIX/Linux operating systems and it utilizes the PostgreSQL database and Apache web server software. Sequence homology matching uses BLAST (basic local alignment search tool) (Altschul et al., 1997).

Selection of strains and DNA sequencing
A total of 22 strains were investigated by whole genomic sequencing (Table 1). They were selected because they were already serotyped and they represented all serotypes. Strains were kept frozen at − 80 • C and cultivated on blood agar with 5 % bovine blood (Blood Agar Base, CM55, Oxoid) to confirm purity. A single colony was used to inoculate Brain Hearth Infusion broth (CM1135, Oxoid) and cultivation was done overnight at 37 • C. DNA was extracted by Maxwell RSC Cultured Cells DNA kit (Promega, Madison, USA). Eight-teen strains were whole genome sequenced by Miseq Illumina (Anicon, Germany; NGS-MiSeq, University of Copenhagen, Denmark) and four by IlluminaHiseq2000 (National High-throughput DNA Sequencing Centre, Copenhagen, Denmark).

Bioinformatical comparisons
Reads were trimmed (0.01 error probability) and assembled in CLC Genomic Workbench v. 20.04 (QIAGEN). To calculate the DNA similarity and estimate the digital DNA-DNA hybridization (dDDH) between draft genomes, the GGDC ver. 2.1 (Genome-to-Genome Distance Calculator) (http://ggdc.dsmz.de/distcalc.php) server was used based on formula 2 (d 4 ) calculating the sum of all identities found in HSPs (high scoring segment pairs) divided by overall HSP length (Auch et al., 2010a,b). Automatic annotation was performed by RAST (rapid annotation using subsystem technology) (http://rast.nmpdr.org/) (Overbeek et al., 2014). BLAST (Altschul et al., 1997) search (version 2.12.0+) performed with blastn and default setting directly on the NCBI server was used for comparison of sequences and to identify target genes in whole genomes. For comparison of target gene sequence to whole genomic sequences, the option "Align two or more sequences" was used with default settings. MLST types were extracted from the genomes by comparison to the multilocus sequence type (MLST) database for M. haemolytica (http://pubmlst.org/mhaemolytica/) (Jolley et al., 2004;Petersen et al., 2009). Multiple alignments and neighbour joining phylogenetic analysis including calculation of bootstrap support were done by ClustalX2 (Larkin et al., 1997), and MEGA7 (Kumar et al., 2016) was used for graphical representation of trees.

Setup of BIGSdb
BIGSdb v. 1.18 was downloaded (https://github.com/kjoll ey/BIGSdb) and installed on Linux server OpenSuse desktop 13.1 × 64bit OS at University of Copenhagen. Sequence handling routines are provided by the BIO PERL library and EMBOSS suite of programs and BLAST v. 1.4.4. Further details are provided online in "About BIGSdb" at ivsmlst.sund.ku.dk. The sequence database was setup with genes being unique for each serotype as loci. Scan for alleles in the isolates database of BIGSdb was based on at least 40 % coverage and 70 % identity with blastn word size of 20 and other parameters as default settings. The server is available to the public and free to use from htt ps://ivsmlst.sund.root.ku.dk.

DNA sequence analysis and MLST comparisons
The assembly of the 22 genomes resulted in lengths of the concatenated sequenced from 2,349,050 to 2,704,499 bases, with between 54 and 301 contigs, N 50 between 12,537 and 123,498, coverages of 45-300 and a GC content of around 41 % (supplementary Table S1). The quality of sequencing and assembly were within thresholds recently defined (Chun et al., 2018;EFSA., 2021).
A correlation between sequence types and serotypes was not found (Table 1). For instance, sequence type 1 included 5 serotypes and serotypes 2, 6 and 9 each included more sequence types. The results confirmed previous results that showed that serotypes could not be predicted based on clonality on the whole genome level for M. haemolytica (Christensen et al., 2020). This result excluded MLST or whole genomic MLST schemes for prediction of serotypes. The ability to predict serotypes of M. haemolytica from whole genomic sequences seemed confined to the capsular gene region.

Analysis of the capsular gene region with respect to serotype prediction
The capsular gene region published by Lo et al. (2001) (GenBank accession number AF170495), for a strain labelled A1, was used as a reference. In this strain, 12 open reading frames (ORF) were identified of which the first four were annotated as cpxABCD (RAST annotation as kpsT, kpsM, kpsE and wcbC) and the last four as wecBC and phyAB (RAST annotated as kpsC and kpsS). Within the central part, ORF1 and ORF2 were annotated with unknown function and ORF3 and ORF4 as UDP-N-acetyl-D-mannosamine and UDP-N-acetylglucosamine 2 epimerase encoding, respectively. Comparison between the genomic sequences of all serotypes showed that cpxABCD and phyAB genes in the capsular encoding regions were homologous in all reference strains. The phylogeny based on the concatenated DNA sequences of the six genes cpxABCD and phyAB showed a separation of all serotypes except for serotype 14 intermingling with the two strains of serotype 9 (Fig. 2). The divergence was related to the phyAB region whereas the cpxABCD 3 allowed a separation of the two serotypes. A total of 39 ortholog coding sequences were identified in the variable region for all strains out of which 15 were predicted as transferases and 10 with other functions whereas 14 were hypothetical proteins ( Supplementary Fig. S2). The ORFs in this variable part were compared pair-wise by BLAST between all reference strains and the ORFs that varied between serotypes were identified ( Supplementary Fig. S2). The gene wecB (UDP-glucose dehydrogenase) was the most conserved gene within the variable region and found in all strains except for strain P895 of serotype 12 and 109/95 of serotype 17. The target genes for prediction of serotypes were selected to be unique for the serotype and the resulting limits of nucleotide identity were more that 99 % identity and 40 % coverage within the serotype and less than 95 % identity and 38 % coverage between the serotypes. The target for serotype 1 was predicted as UDP-N-acetyl-D-mannosamine dehydrogenase whereas targets for serotypes 2, 7, 16 and 17 were predicted as transferases. The targets for serotypes 5, 6, 9, 12, 13 and 14 were hypothetical proteins ( Supplementary Fig. S2). For serotype 8, a unique target could not be identified and instead the phyA gene was selected. The gene was present in all serotypes (Fig. 2), however, sufficient sequence divergence was still present to allow a separation of serotype 8 from the other serotypes. To improve the specific detection of this serotype, the target region was truncated only to include the first 1120 nt of the coding sequence. For serotypes represented by more strains, variation within the target locus was scored as alleles.
The target gene locus of serotype 1 (UDP-N-acetyl-D-mannosamine dehydrogenase) was located next to the targets selected for the PCR by Klima et al. (2017). For serotype 2, the target (unknown sialyltransferase) diverged (J450_07000 in CP006573) compared to the target of Klima et al. (2017) (J450_08930). For serotype 6 the target (uncharacterized ATP-grasp domain) was identical to the one in Klima et al. (2017) (D648_40 in CP004753).
The variable part of the capsular gene region included more gene targets than used in the current scheme and more genes may be included as specific loci for serotypes in the future (see Supplementary Fig. S2). Further investigation can also focus on an alternative strategy for serotype prediction based on a MLST like typing system defining the conserved genes cpxABCD and phyAB of the capular region as loci and scoring a serotype as a sequence type over allelic differences in all of the six loci.

Analysis of published whole genomic sequences of M. haemolytica
A prediction of serotypes of M. haemolytica genomes deposited with NCBI (Sayers et al., 2021) by the web-server at ivs.mlst.sund.ku.dk is shown in Supplementary Table S2. Eighty-two, 76 and 27 strains were predicted as serotypes 1, 2 and 6, respectively and one in each of the serotypes 9, 12 and 14. For serotype 1 with four alleles, variation occurred at six positions. Higher variation was found for serotype 2 where the 9 alleles were related to variations at 12 positions. The three alleles of serotype 12 were related to variations at two positions. The two alleles of serotypes 6 and 9 were related to a single nucleotide difference and the two alleles of serotype 14 related to two mismatches. Allelic differences both resulted in synonymous and non-synonymous substitutions.
Seven strains listed as M. haemolytica in NCBI without serotype annotation could not be typed using the new tool for serotype prediction. Strain NCTC 10208 (accession no. UGPM01) listed with serotype 11 was predicted with low similarity to serotype 12. GGDC prediction of NCTC 10208 showed 71 % dDDH to the type strain of M. glucosida and 59.4 % to M. haemolytica indicating species affiliation with M. glucosida using 70 % DDH as threshold (Wayne et al., 1987). Strain NCTC 10643 (accession no. LR134495) was also labelled as serotype 11 and it was also not typed by the web-browser, however, the dDDH to the type strain of M. haemolytica was 76.8 % with low dDDH to M. glucosida. Another not typable strain NIVED/MH/1 listed as M. haemolytica, had low dDDH to the type strain of M. haemolytica of 59.3 % documenting that it was not part of M. haemolytica which explained that its serotype could not be predicted. Strain 2297 MH was both typed as serotype 1 and 2 with targets on contigs 163 and 187 (accession no. LQDK01), respectively. This divergence seemed related to strain contamination. The reasons for the lack of typing of the remaining three stains (587A, D35, 12540) could not be found. They were related to the type strain of M. haemolytica with 87.6 % or higher dDDH, however, target sequences were not detected in the genomes. The most likely explanation is that they belong to the group of not typable (UT) described by Donachie et al. (1984). It was possible to predict the serotypes of strains published as  Donachie et al. (1984) were predicted as serotype 1 allele 2 (data not shown). Other not typable strains from the study were found to belong to other species than M. haemolytica and their serotype could not be predicted (unpublished data). This result showed that prediction is only possible with strains of M. haemolytica. The current investigation using DNA sequences encoding for loci of the capsular gene region did not show any match to bacteria outside M. haemolytica, however, species level identification is still recommended since serotyping (IHA) of a few strains not belonging to M. haemolytica was reported by Angen et al. (1999b).

Conclusion and further investigations
The web-server created allows for all serotypes of M. haemolytica to be predicted from whole genomic sequences. It is free and available to the public (i.e., open access). To serotype an isolate, it is recommended to first confirm the identification of M. haemolytica by MALDI-TOF or one of the species-specific PCRs published for this purpose (Kuhnert et al., 2012;Loy et al., 2018;Conrad et al., 2020;Pansri et al., 2020). Then, the isolate needs to be whole genome sequenced and after assembly of the sequence, it has to be uploaded to the server for prediction of the serotype. At this step, it is recommended to further confirm the identification of M. haemolytica based on the whole genome sequence comparison using TYGS (type strains genomics server) (Meier-Kolthoff and Göker, 2019) or similar tool. It is possible to improve and further develop the prediction of serotypes based on whole genomic sequences of M. haemolytica related to the scalability of BIGSdb and it is most urgently needed to include more strains especially of serotypes 8, 13, 14, 16 and 17.

Declaration of Competing Interest
We declare that we have no conflict of interest.