'Candidatus Ornithobacterium hominis': insights gained from draft genomes obtained from nasopharyngeal swabs

‘Candidatus Ornithobacterium hominis’ represents a new member of the Flavobacteriaceae detected in 16S rRNA gene surveys of people from South-East Asia, Africa and Australia. It frequently colonizes the infant nasopharynx at high proportional abundance, and we demonstrate its presence in 42 % of nasopharyngeal swabs from 12-month-old children in the Maela refugee camp in Thailand. The species, a Gram-negative bacillus, has not yet been cultured, but the cells can be identified in mixed samples by fluorescent hybridization. Here, we report seven genomes assembled from metagenomic data, two to improved draft standard. The genomes are approximately 1.9 Mb, sharing 62 % average amino acid identity with the only other member of the genus, the bird pathogen Ornithobacterium rhinotracheale. The draft genomes encode multiple antibiotic-resistance genes, competition factors, Flavobacterium johnsoniae-like gliding motility genes and a homologue of the Pasteurella multocida mitogenic toxin. Intra- and inter-host genome comparison suggests that colonization with this bacterium is both persistent and strain exclusive.


INTRODUCTION
During previous work on the nasopharyngeal microbiota of children in the Maela refugee camp in Thailand, an abundant unclassified taxon was discovered through 16S rRNA gene sequencing [1]. It was >99 % identical to other unclassified sequences reported in nasopharyngeal samples from the Gambia [2,3], Kenya [4] and Australia [5], and the gene shared 93 % nucleotide identity with that of the avian respiratory pathogen Ornithobacterium rhinotracheale. On the basis of 16S rRNA gene similarity, the taxon was presumed to represent a new species of Flavobacteriaceae, closely related to the genus Ornithobacterium. The taxon was of interest because it was ubiquitous in the study group of 21 children, appearing to be a persistent colonizer and at a proportional abundance up to 71 %. The 16S rRNA gene sequences could be divided into three oligotypes [6]: each appeared to be carried persistently and exclusively by their host [1].
As the bacterium could not be cultured from archived swabs, ten DNA samples from the initial study were selected for metagenomic sequencing to maximise recovery of the genome of interest while representing a range of children, ages and 16S rRNA gene oligotypes: seven were successfully sequenced following multiple displacement amplification (MDA). The extracted genomes were then used to design a PCR-based prevalence screen for samples from the Maela cohort and a fluorescent probe to visualize the cells in mixed samples. On the basis of this genomic analysis, we propose the unclassified taxon as 'Candidatus Ornithobacterium hominis'.

METHODS Samples
Between 2007 and 2010, a cohort of 955 infants born in the Maela refugee camp on the Thailand-Myanmar border were followed from birth until 24 months of age in a study of pneumococcal colonization and pneumonia epidemiology [7,8]. Pernasal swabs of the posterior wall of the nasopharynx were collected from each infant at monthly intervals using Dacron tipped swabs (Medical Wire and Equipment). Immediately following collection, the nasopharyngeal swab (NPS) samples were placed into STGG (skimmed milk, tryptone, glucose and glycerol) storage medium and frozen at À80 C within 8 h. An additional sample was taken if the infant was diagnosed with pneumonia according to World Health Organization clinical criteria [9].

DNA extraction and sequencing
Preliminary work was performed on DNA extracted from swabs with a FastDNA spin kit for soil (MP Biomedicals), which was then amplified using MDA. MDA reagents were filtered at 0.2 µm, endonuclease digested with '29 enzyme and UV irradiated (254 nm) prior to use to remove any exogenous DNA from the subsequent amplification reaction [10]. A 3 µl sample was heat denatured with 2 µl heat denaturation buffer [20 mM Tris HCl pH 8.0, 2 mM EDTA and 400 µM PTO random hexamers (Eurofins Genomics)] at 95 C for 3 min. Reaction master mix (15 µl) [1Â RepliPHI reaction buffer, RepliPHI '29 enzyme (6.7 U µl À1 ) (Epicentre), 0.5 mM dNTP, 50 µM PTO random hexamers, 5 % DMSO, 10 mM DTT] was then added, samples were incubated at 30 C for 16 h, and the reaction halted by final incubation at 65 C for 20 min. Eight separate reaction volumes were processed per sample, these parallel reactions were pooled before sequencing using the MiSeq 250 bp pairedend protocol with a 450 bp library fragment size.

Genome analysis
Samples were selected to maximise recovery of the genome of interest by choosing those with a high proportional abundance of the bacterium from previous 16S rRNA gene sequencing data [1]. Raw reads from MDA samples were first classified using Kraken v. 0.10.6 [11] and parsed to remove any reads classified as mammal, Moraxella, Haemophilus or Streptococcus. The remaining reads were then assembled using SPAdes v. 3.10.0 [12]. Contigs shorter than 500 bp or with mean coverage below 4Â were discarded. For the two well-assembled samples, a BLAST+ v. 2.7.0 [13] screen of all contigs against the nr database was used to discard those that closely matched other known nasopharyngeal bacteria. The contigs first brought forward for the draft genomes were those that had consistent, low-identity matches to O. rhinotracheale. Samples were then reciprocally compared using BLAST + to find further contigs present in all runs. These curated contig sets were manually improved using Gap5 v. 1.2.14 [14] and targeted PCR for gap closure, resulting in two syntenic draft genomes. The other assemblies with large numbers of short contigs were screened by comparison against the draft genomes using BLAST+ and extracting all contigs with >10 % length hit. Automated annotation of curated contigs was performed using Prokka v. 1.11 [15] and the RefSeq database [16]. Average nucleotide identity (ANI) and average amino acid identity (AAI) were calculated using the enveomics calculators [17,18]. ANI used a minimum length 700 bp and minimum identity 70 %, with a 1000 bp fragment window size and 200 bp step. AAI used a 20 % identity cut-off. The reciprocal percentage of conserved proteins (POCP) was calculated as described by Qin et al. [19]. The core and accessory genomes were calculated using Roary v. 3.11.3 [20]. Phylogenetic trees were built using RAxML v. 8.2.8 [21]. Metabolic analysis was performed using KEGG Mapper v.3.1 [22]: annotation with BlastKOALA [23] and the prokaryotic KEGG GENES database, followed by comparison of pathways from Reconstruct Module.

IMPACT STATEMENT
The nasopharynx is part of the respiratory tract and hosts a unique microbial community that is established during infancy and changes throughout life. The nasopharyngeal microbiome is important as it includes bacteria that can cause diseases such as otitis media or pneumonia, as well as non-pathogenic species. In Maela, a refugee camp in Thailand, we identified a prevalent bacterial species colonizing children under the age of 2 years and occasionally their mothers. We were not able to culture it from frozen samples, but could visualize the cells microscopically using a fluorescent probe. Its genetic signature can be seen in published data from several countries, suggesting that the species is globally distributed. From analysis of the genome, we confirm it is highly divergent from its closest characterized relative, the respiratory pathogen Ornithobacterium rhinotracheale, which infects turkeys, chickens and other birds. We propose the name 'Candidatus Ornithobacterium hominis' and describe a screening protocol to detect its presence in samples.

Prevalence screen
Prevalence of carriage in the infant population of Maela was estimated using a quantitative PCR (qPCR) screen direct from NPS storage medium. The 12 months of routine samples from 100 randomly selected infants (excluding twins) were used. The age of child at time of sampling ranged from 359 to 377 days, median 365 days. Concurrent swabs were also acquired from the mothers and screened to assess maternal carriage in relation to infant carriage. qPCR was performed on an Applied Biosystems 7500 RT PCR machine using PowerUp SYBR Green master mix (Applied Biosystems) in 20 µl reaction volumes. The 16S rRNA gene screen targets the V2-V5 region with forward primer 5¢-C TTATCGGGAGGATAGCCCG-3¢ and reverse 5¢-GAAG TTCTTCACCCCGAAAACG-3¢, yielding a 700 bp product under the conditions: 94 C for 5 min (cell lysis); then 40 cycles of 94 C for 30 s, 53 C for 30 s, 68 C for 1 min; ending with a melt curve. A positive result was a cycle threshold (C t ) <40 and peak melting temperature (T m ) of 80-86 C. The ToxA gene screen with forward primer 5¢-TATCTC TCACAGAGCTAGGCTTGAGCGTGG-3¢ and reverse 5¢-TGCTATATTTGGGAAAGGCGCATGAATACC-3¢ yields a 1.95 kb product under the conditions: 94 C for 5 min (cell lysis); then 40 cycles of 94 C for 30 s, 58 C for 2.5 min, 68 C for 2.5 min; ending with a melt curve. A positive result was C t <40 and peak T m of 77-79 C.
A positive result for both targets was interpreted as carriage-positive, a negative result for both targets was interpreted as carriage-negative. Non-concordant results (positive/negative or negative/positive) were treated as a separate group. This assessment of carriage prevalence may be affected by several factors: recent antibiotic consumption, low microbial biomass, age under 6 months (as inferred from previous work [1]) or technical error during swab collection may lead to a lower estimate. Presence of dead bacterial cells in the nasopharynx or cross-contamination during sample handling may lead to false positives. The sample size of 100 was selected as adequate to encompass a predicted prevalence of 10-90 % with a precision of 5% and 95 % confidence interval (CI) [24]. This sample size is approximately one tenth of the total population being estimated, i.e. all 12-month-old children who were born in Maela between 2007 and 2008.

Protein modelling
The 'Candidatus O. hominis' ToxA protein sequence was searched using the program FUGUE [25] against a database of all chains of the Protein Data Bank (PDB) as of June 2017 [26]. Significant similarity with 30 % sequence identity was found for residues 554-1269 to chain X of PDB ID 2EBF [27], with corresponds to the C-terminal region of the Pasteurella mitogenic toxin. The matched region was aligned to the chain sequence using FUGUE, and models were generated with MODELLER v.9.15 using 'very slow' refinement [28]. Visualization of the resulting models was performed on PyMOL Molecular Graphics System v1.8.

Microscopy
A suspension of NPS STGG sample was fixed overnight at 4 C in 3 % paraformaldehyde, and dehydrated in suspension with 96 % ethanol. Fluorescent hybridization with an Alexa546-labelled probe (Invitrogen) was performed on a fixed sample in buffered suspension (20 mM Tris-HCl, 0.9 M NaCl, 0.1 % SDS) for 2 h at 55 C, and washed with 20 mM Tris-HCl, 0.9M NaCl for 5 min at 55 C. The samples were then suspended in water and applied to standard microscope slides, dried, then incubated in a 300 nM solution of DAPI (4',6-diamidino-2-phenylindole dihydrochloride) in 1x PBS, for 5 min at room temperature. The slides were rinsed with 1x PBS, rinsed again with water, dried and a coverslip applied with ProLong diamond antifade mountant (Invitrogen). The probe with sequence 5¢-GUUCUU-CACCCCGAAAACG-3¢ targets the V5 region of the 16S ribosomal RNA, corresponding approximately to position 822-840 of the 16S rRNA gene of Escherichia coli. It was not found to bind to an O. rhinotracheale sample that was fixed and processed in parallel. A Leica TCS SP8 confocal fluorescence microscope was used to visualize probed cells with laser wavelengths 405 nm and 552 nm. Images were captured and processed via Leica Application Suite X software.
Culture methods NPS STGG samples were streaked out on 4 % horse blood or chocolate agar (Oxoid blood agar base no. 2; Thermo Fisher Scientific) and incubated at 37 C in aerobic, enriched CO 2 or anaerobic conditions for 48 h. Brain heart infusion broth (25 ml) (Thermo Fisher Scientific), with and without 1 µg ampicillin ml À1 , was inoculated with 5 µl STGG medium and incubated either static or shaking at 37 C for up to 1 week. Other nasopharyngeal species were recovered from NPS STGG samples following these methods, but 'Candidatus O. hominis' was not. The O. rhinotracheale sample used as a negative control for fluorescent hybridization was cultivated on 4 % horse blood agar in microaerobic conditions at 37 C for 48 h, colonies were then scraped from plates and fixed using the protocol described above.

RESULTS
Genomes of 'Candidatus O. hominis' were assembled from metagenomic data generated on an Illumina MiSeq [Data Citation 1]. Despite significant loss of sequence coverage to human and other bacterial genomes, two samples assembled into 9 and 15 contigs from 'Candidatus O. hominis', yielding draft genomes predicted to be nearly complete based on the detection of all ribosomal protein genes and by intersample comparison. A further five samples assembled into larger numbers of small contigs, which were aligned to the draft genomes and found to cover most of the expected length ( Table 1). The 'Candidatus O. hominis' genome is approximately 1.9 Mb, 20 % smaller than its closest relative O. rhinotracheale.

Prevalence
A Candidatus O. hominis' specific real-time PCR detection protocol for V2-V5 of the 16S rRNA gene was designed using full-length gene sequences and tested on NPS STGG samples and on metagenomic DNA of known bacterial composition. This PCR screen was then applied directly to STGG medium from the archived NPS samples of 100 randomly selected 12-month-old infants in Maela, and concurrent swabs from their mothers. A second PCR screen was developed targeting the toxin gene toxA and was also performed on the archived NPS samples. The two PCR targets were concordant in infant samples, resulting in 42 positive and 58 negative results, giving an estimated carriage prevalence among 12-month-old infants in Maela of 42 % (95 % CI: 32.3-51.7). From the mothers, 2 samples were positive, 93 negative and 5 were either equivocal or nonconcordant ( Table 2). The C t values were higher in maternal samples than infants, which may indicate a lower bacterial load. Of the 100 infant samples, 12 were also sequenced during the earlier 16S rRNA gene study [1] and 11 of those qPCR results were in agreement with the previous data. One sample with a predicted proportional abundance of 3 % 'Candidatus O. hominis' had a negative qPCR result, while two others of low proportional abundance gave positive qPCR results with C t values approaching the limit of detection.
Genetic similarity between 'Candidatus O. hominis' and O. rhinotracheale The position of 'Candidatus O. hominis' in the context of the Flavobacteriaceae, based on 16S rRNA gene sequences, is illustrated in Fig. 1 [19], may be used to gauge the relatedness of two genomes at the genus level. To be considered conserved for this measure, a gene must share >40 % amino acid identity over >50 % of its length: two members of the same genus are expected to have at least half of their proteins in common. The POCP between UMN-88 and 'Candidatus O. hominis' is approximately 58 %. Although these figures are based on draft genomes, as 50.7 % of UMN-88 proteins are conserved in 'Candidatus O. hominis' from these data, they are likely to be distantly related members of the same genus. The two genomes OH-22767 and OH-22803 have an ANI of 98.78 %, above the 96 % threshold for strains of the same species [30].  Results of the qPCR screen for two targets in mother and infant samples.

Core genome
The core genes shared between the draft genomes and lower quality assemblies, along with the accessory genes unique to each, were calculated using Roary [20]. Due to the lower quality assemblies containing gaps, just 935 kb or approximately 50 % of the draft genome size was identified as 'core genome' using this analysis. A core genome phylogeny was generated using >13 000 SNPs identified in this shared sequence [31]. Samples taken from the same child at different dates are very similar (Fig. 3), adding to the initial 16S rRNA gene oligotype data that inferred long-term carriage of the same or closely related strains in this cohort. Features of interest that are present in the core genome include a large toxin gene toxA and gliding motility-associated genes.
The 3.8 kb gene toxA is present in all sequenced samples and was detected by PCR in all 16S-positive swabs from children. It is predicted to produce a secreted toxin similar to the Pasteurella mitogenic toxin (PMT). Although 'Candidatus O. hominis' ToxA and PMT share only 35 % amino acid identity overall, there is greater conservation around the predicted active sites, and the modelled structure of the C-terminal domain is extremely similar (Fig. 4).
The core genome includes a full complement of 14 gld genes that are homologues of those required for gliding motility by Flavobacterium johnsoniae. This mechanism involves the movement of an adhesin around the cell membrane in a helical path, thereby pulling the bacterium rapidly along a substrate [32]. Most of these genes in Flavobacterium johnsoniae are also components of the Bacteroidetes type IX secretion system [33,34]

Accessory genome
Around half of the accessory genome is made up of hypothetical protein genes, most of which are similar to those of other members of the Flavobacteriaceae such as O. rhinotracheale, Capnocytophaga, Chryseobacterium, Elizabethkingia, Flavobacterium, Riemerella or Weeksella. It also includes evidence of mobile elements and associated drug-resistance genes, rhs (rearrangement hotspot) genes, and two distinct lipopolysaccharide (LPS) production clusters.
A portion of the hypothetical protein genes encode the Fibrobacter succinogenes major paralogous domain (PF09603). Up to 17 of these genes are present in each  genome. A quarter of them are also predicted to possess an immunoglobulin-like fold (IPR013783). No two samples have a complement of completely identical predicted genes, but more are shared between samples that group together based on the core genome (Fig. 3).
The genomes contain different complements of bacteriophage-associated genes, type I, II and III restriction modification systems, diverse variants of DNA degradation genes dndCDE [35], abortive infection systems, and transfer and mobilization genes. In some cases, these are present in small regions of <10 kb and lack any clear structure, while in others they are part of a defined element such as the 30 kb tailed bacteriophage (22803_00899-00942) [Data Citation 3] or the 19 kb element (22803_01685-01710) containing efflux genes and flanked by 350 bp imperfect direct repeats. All sequenced samples possess efflux transporters of the MATE and RND families, but genomes from cluster A ( Fig. 3) also have a B1 metallo-b-lactamase (22767_01182) [Data Citation 2], streptogramin lyase (22767_01181) and ampC gene (22767_01179) within a partially-assembled mobile element. Genomes from cluster B possess an extended-spectrum b-lactamase gene per1 on the well-characterized transposon Tn4555 [36].
The accessory genome of strains from cluster A encodes a number of elements containing the conserved RHS repeatassociated core domain (TIGR03696) with extremely variable C-termini and unique hypothetical protein genes immediately downstream. There is one large rhs gene of >9 kb, 22767_01758, that encodes a protein sharing signatures with the Salmonella plasmid virulence protein SpvB (IPR003284) and the bacterial insecticide toxin TcdB (IPR022045), while the other smaller ORFs may be the dissociated tips generated from lateral acquisition of variable C-termini as seen in Serratia marcescens [37]. This large rhs gene with displaced tips is only found in samples from two children, ARI0106 and ARI0073; some but not all of the rhs gene tips differ between them. These samples also possess a further rhs gene with no displaced tips, that has two predicted phospholipase D domains (IPR001736). Some Rhs proteins have been shown to act as competition factors [38], consistent with 'Candidatus O. hominis' being a persistent member of the nasopharyngeal microbiota.
Although all the genomes possess a 30 kb LPS production gene cluster (Fig. 5a-  is coloured according to amino acid identity to PMT with identical residues in blue, non-identical in purple and insertions in grey. In PMT, the C1 region is responsible for plasma membrane localization, the C3 region forms an active pocket and is responsible for mitogenic activity in mammalian cells.
successful cultivation, or it may be that the cells have not survived storage due to environmental stresses [39].
In earlier work [1], it was noted that each child was colonized with only one of the three detected 16S rRNA gene oligotypes representing 'Candidatus O. hominis'. Colonization was persistent, i.e. constituting >5 % of the proportional abundance of taxa for at least 5 consecutive months in 13 out of 21 children, but was detected in all children at some point during the study. Here, we describe multiple similar 'Candidatus O. hominis' genomes taken from time-points that were 4-10 months apart in two children, adding evidence to the hypothesis that long-term colonization is restricted to a particular strain for each host. Given the high prevalence of 'Candidatus O. hominis' in Maela and the genetic diversity observed between contemporaneous samples from only four children, this exclusion of diversity from the host may be explained by microbial competition systems, such as the SdpABC-like toxin system identified on a mobile element in OH-22298 [40] or Rhs proteins present in all samples [38]. Due to the high frequency of clinical pneumonia in Maela (0.73 episodes per child year [8]), the children and their microbiota are frequently exposed to blactam antibiotics. In all sequenced samples, we found evidence of horizontally acquired drug-resistance genes, which may also aid persistent colonization.
Bacterial LPS may confer advantages in adhesion and avoidance of complement-mediated cell lysis, although it is also a key target for the host immune system [41]. The gene content of LPS cluster variant A is somewhat similar to that of the O. rhinotracheale serotype A (Fig. 5a), the most common of the 18 known serotypes [42]  Gliding motility is often associated with firm dry surfaces [43], though it is also found among oral bacteria such as those from the genera Cytophaga and Capnocytophaga. The ability to move independently in the environment is advantageous for scavenging nutrients, for complex biofilm formation and to bring about contact with other bacterial or host cells. The gld genes that are required for gliding motility among the Flavobacteriia, Cytophagia and Sphingobacteriia overlap with those of the type IX secretion system, so a subset of these genes is also found in non-motile relatives [44]. Despite possessing all 14 gld genes and further required genes sprAET, these 'Candidatus O. hominis' genomes do not include a SprB-like adhesin and putative gliding motility must be confirmed phenotypically.
PMT is a toxin produced by some serovars of Pasteurella multocida that causes a range of host pathologies including nasal bone resorption [45], lower respiratory tract disease [46,47] and dermonecrotic wound infections [48], and has also been shown experimentally to affect the heart [49], liver [50] and bladder [51]. It acts by deamidating the a subunits of several heterotrimeric G proteins, thereby activating mitogenic signalling pathways [52,53]. In the Maela cohort, there are no reports of PMT-like toxin mediated disease, strongly suggesting that despite structural similarities the expression or function of 'Candidatus O. hominis' ToxA is different to that of P. multocida.
In conclusion, we have assembled seven genomes representing a new species of nasopharyngeal bacteria proposed as 'Candidatus O. hominis', from nasopharyngeal swabs collected from a cohort of children in the Maela refugee camp in Thailand. Although phenotypic characterization is not yet possible due to undetermined storage or culture requirements, several points of interest have been identified for further investigation. These include the predicted gliding motility phenotype, two LPS variants and the production of a protein similar to the Pasteurella mitogenic toxin. The prevalence of 'Candidatus O. hominis' colonization appears to be approximately 42 % of 12-month-old children in Maela refugee camp and needs to be estimated in other populations.
Funding information S.J.S., P.S., A.J.P., A.T., M.C.de G., C.C., J.P. and the sequencing costs were supported by the Wellcome Trust (grant no. 098051). B.O-M. was supported by the Bill and Melinda Gates Foundation (grant no. RG60453). P.T. was funded by Wellcome Trust Clinical Training Fellowship (grant no. 083735/Z/07/Z). Shoklo Malaria Research Unit is part of the Mahidol Oxford University Research Unit supported by the Wellcome Trust of Great Britain. The funders had no role in study design, data collection and analysis, decision to publish or preparation of the manuscript.