Target-enrichment sequencing yields valuable genomic data for challenging-to-culture bacteria of public health importance

Genomic data contribute invaluable information to the epidemiological investigation of pathogens of public health importance. However, whole-genome sequencing (WGS) of bacteria typically relies on culture, which represents a major hurdle for generating such data for a wide range of species for which culture is challenging. In this study, we assessed the use of culture-free target-enrichment sequencing as a method for generating genomic data for two bacterial species: (1) Bacillus anthracis, which causes anthrax in both people and animals and whose culture requires high-level containment facilities; and (2) Mycoplasma amphoriforme , a fastidious emerging human respiratory pathogen. We obtained high-quality genomic data for both species directly from clinical samples, with sufficient coverage (>15×) for confident variant calling over at least 80% of the baited genomes for over two thirds of the samples tested. Higher qPCR cycle threshold (Ct) values (indicative of lower pathogen concentrations in the samples), pooling libraries prior to capture, and lower captured library concentration were all statistically associated with lower capture efficiency. The Ct value had the highest predictive value, explaining 52 % of the variation in capture efficiency. Samples with Ct values ≤30 were over six times more likely to achieve the threshold coverage than those with a Ct > 30. We conclude that target-enrichment sequencing provides a valuable alternative to standard WGS following bacterial culture and creates opportunities for an improved understanding of the epidemiology and evolution of many clinically important pathogens for which culture is challenging.


INTRODUCTION
Genomic data can provide critical information towards understanding bacterial epidemiology and evolution, and thereby help to guide public health policy and infectious disease control [1]. Whole-genome sequence (WGS) data contribute to determining transmission dynamics at various scales [2], and are used to track the emergence and evolution of particular strains or traits [3].

ACCESS
WGS is routinely implemented for the characterization of a wide range of bacteria, but typically involves working from cultured isolates to yield DNA of sufficient quantity and quality. This reliance on culture as a first step represents a major hurdle for generating genomic data for many bacteria of public health importance. For instance, for bacteria with high biosafety requirements, suitable laboratories may not be available. Similar obstacles exist for fastidious bacteria (i.e. due to their complex nutritional requirements and slow growth), rendering their culture for genetic analysis difficult or impossible [4]. Moreover, while sequencing can be performed directly on extracts from clinical samples (e.g. bodily fluids and tissues), pathogen DNA is often present only in low quantity, representing only a fraction of the overall DNA in a sample [5], thus complicating WGS-based analyses.
Hybridization capture followed by sequencing, or 'target-enrichment sequencing' , is one approach that may be able to overcome this obstacle and enable bacterial genomic sequence data to be obtained more effectively from clinical samples [6,7]. In this culture-free method, libraries are prepared from DNA extracted from the sample, after which specific fragments are selectively enriched prior to sequencing. This is accomplished using custom-designed probes, or 'baits' -typically RNA oligonucleotides labelled with biotin that hybridize to the DNA of interest and are then captured by streptavidin-coated magnetic beads, allowing non-target DNA to be removed through wash steps [8]. Baits are able to selectively capture a wide range of fragment sizes [8,9] and also allow for polymorphisms between their sequence and that of the enriched fragment; depending on the stringency of capture, this can be as low as 70 % sequence similarity [8]. When compared to sequencing directly from samples, target-enrichment sequencing has been shown to substantially increase the proportion of on-target reads and sequencing depth [4]. For several bacterial species, this has enabled the reconstruction of whole-genome sequences from relatively small amounts of target DNA from clinical samples, including sputum, vaginal swabs and cerebrospinal fluid, as well as from arthropod vectors [4,[10][11][12]. This approach thus has the potential to facilitate in-depth investigations, including phylogenomics, into otherwise difficult-to-study pathogens.
In this study, we evaluated target-enrichment sequencing as a means of generating genomic data directly from samples containing one of two different difficult-to-culture bacteria: Bacillus anthracis and Mycoplasma amphoriforme. Bacillus anthracis is a Grampositive bacterium that causes anthrax -a neglected bacterial zoonosis that is a threat to livestock and human health in low-andmiddle-income countries (LMICs) [13]. The high virulence of B. anthracis makes it hazardous to culture and therefore requires containment level (CL) 2+ or 3 facilities [14]. Limited high-containment laboratory capacity in LMICs thus represents a barrier to the generation of sequence data for B. anthracis and understanding of its diversity and transmission in endemic settings. Moreover, concerns over the use of B. anthracis as a biological weapon make the shipment of samples for culture elsewhere challenging [15]. Although the true clinical significance of M. amphoriforme is yet to be elucidated, it has been associated with respiratory tract infections in immunocompetent and immunocompromised patients [16][17][18][19][20][21]. While there is great potential to use comparative genomics to gain insights into its pathogenicity and epidemiology, M. amphoriforme requires specialized culture media and conditions; several weeks are required to generate bacterial cultures with suitable cell densities for DNA extraction and not all PCR positive samples yield a positive culture despite high estimated bacterial loads [16,18].
The primary objective of this study was to assess the feasibility of generating sequence data for B. anthracis and M. amphoriforme at coverage suitable for variant calling over the majority of the genome. We also sought to determine the upper pathogen-specific qPCR cycle threshold (Ct) value limit to provide sufficient coverage for accurate variant calling, and the influence of different variables (i.e. Ct, whether libraries were pooled prior to bait capture, and concentration of the captured libraries) on capture efficiency (i.e. the proportion of on-target reads based on reference mapping).

METHODS
Custom myBaits sets were designed for B. anthracis and M. amphoriforme in partnership with Daicel Arbor Biosciences (Ann Arbor, MI, USA). RNA baits of ~80 base pairs (bp) were designed to achieve 50 % overlap (tiling) across targeted regions (i.e. at least two baits covering any given position). For B. anthracis, baits were designed to tile the core chromosomal genome. Publicly

Impact Statement
Sequence data are increasingly relied upon to help understand the epidemiology, evolution and virulence of pathogens. However, for a variety of reasons, many bacterial species remain challenging to culture, precluding genomic data to be obtained using standard methods. Here, we apply a culture-free method -target-enrichment sequencing -to obtain such data from two bacterial species of public health importance: the bacterium that causes anthrax (Bacillus anthracis), and an emerging respiratory pathogen (Mycoplasma amphoriforme). This approach enabled high-coverage sequence data to be obtained for both species. The specific enrichment panels designed for these pathogens will facilitate their further molecular epidemiological study. More broadly, this study highlights the value of this methodology for genomic sequencing of bacteria in situations where they cannot be cultured.
available B. anthracis genomes available at the time of bait design (n=52 ; Table S1, available in the online version of this article) -including genomes from all three major genetic subgroups (clades A, B and C) [22] -were used to generate a core-genome alignment in Parsnp [23]. This alignment was 4.66 MB (89 % of the 5.23 MB Ames Ancestor reference chromosomal genome, NC_007530.2 [24]). Plasmid (PXO1 and PXO2) sequences were not included in bait design because the primary goal was to use the data for phylogenetic inference, for which accessory genome content is unsuited. For M. amphoriforme, bait sequences were designed to cover the pangenome, which was constructed using all M. amphoriforme whole-genome sequences available at the time of the experimental design, and included the A39 reference genome NCTC 11740 (accession number HG937516; 1 029 022 bp) and 18 additional M. amphoriforme genomes [19]. Pairwise comparison of these genomes was then done using Artemis Comparison Tool [25], concatenating any regions of difference to the reference genome of strain A39, resulting in a pangenome of 1 064 467 bp. For both bacterial species, baits that targeted loci with high copy number in the genome were omitted (e.g. rRNA, tRNA), as were simple repeats, which were soft-masked using RepeatMasker [26]. Additionally, in silico analysis was performed to identify and remove any baits that could potentially cross-hybridize with host or non-target bacteria (File S1, Table S2 [18] during two surveillance studies were included. The NPS originated from a longitudinal Streptococcus pneumoniae carriage study on the Thailand-Myanmar border [29], and were collected from 11 participants, each providing between three and six consecutive swabs between 2007-2010. The NPA originated from influenza virus PCR surveillance also on the Thailand-Myanmar border in 2016 and 2017. The Ct values for M. amphoriforme samples ranged from 20 to 40 (median=29). Sample-specific metadata for both bacterial species are available as Supplementary Tables A and B at the University of Glasgow's Enlighten data repository [30].
All DNA extracts were quantified using Qubit dsDNA high sensitivity Assay Kits (Thermo Fisher Scientific). All initial libraries (referred to as 'pre-capture libraries') were prepared using the NEBNext Ultra II FS DNA Library Prep Kit for Illumina with NEBNext Sample Purification Beads (New England BioLabs). This protocol includes a short PCR amplification (five-seven cycles) to add sample-specific barcodes and to increase the total DNA concentration prior to bait capture. We varied the number of cycles across library preparation batches to achieve the recommended minimum DNA input for bait capture (100 ng), whilst aiming to use the minimum number of cycles required (details in File S1).
For bait capture, we followed the myBaits Hybridization Capture for Targeted NGS protocol (Daicel Arbor Biosciences, Manual v4.01). For each bait capture, 7 µl of pre-capture library was used as input. Most libraries were captured individually. For a subset, libraries were pooled before bait hybridization with the goal of reducing the cost of baits per sample; this was done for 34 B. anthracis libraries that spanned a range of Ct values, and eight M. amphoriforme libraries prepared from samples with higher Ct values. As per the myBaits protocol, we aimed to include four libraries per pool; however, in some cases where certain initial library preparations were unsuccessful, only three pre-capture libraries were pooled. For each pool, approximately equal input concentrations of each pre-capture library were included to yield a total input of 125 ng in 7 µl.
Following bait hybridization, the number of cycles in the PCR amplification step -included in the myBaits protocol to yield at least 1.5 ng µl −1 DNA for downstream sequencing -was varied among preparation batches; between 12 and 15 cycles were used for B. anthracis and 12 or 13 cycles for M. amphoriforme samples. Post-amplification clean-up was done by column-based purification using Zymo DNA Clean and Concentrator −5 kit (Cambridge Bioscience, UK), eluting in 12 µl of 0.1× TE buffer. We refer to the final product as 'captured libraries' .
Captured libraries were quality checked using a Bioanalyzer (Agilent) and sequenced at Glasgow Polyomics on an Illumina NextSeq instrument, generating 75 bp paired-end reads. Given differences in genome size, approximately 4 and 1 million reads were targeted for B. anthracis and M. amphoriforme, respectively. Raw reads were trimmed with fastp 0.20.1 with default settings [31] and aligned to the B. anthracis Ames Ancestor reference genome or the M. amphoriforme pangenome, respectively, with bwa-mem v0.7.17-r1188 [32], after which PCR duplicates were removed in GATK [33]. We defined a 'successful' sequencing outcome as a sample achieving >15× coverage (the cut-off for confidently calling a genotype in a haploid organism [34]) over 80 % of the baited genome (i.e. the parts of the core B. anthracis genome or M. amphoriforme pangenome against which baits were designed). Full analysis scripts are available on Github [27].
A binomial generalized linear mixed model (GLMM) was performed in R [35] with glmer in the lme4 library [36] to investigate the association between the capture efficiency -defined as the proportion of mapped reads per sample -and the explanatory variables: Ct values from pathogen-specific qPCR performed on original sample DNA extracts; bacterial species (i.e. B. anthracis or M. amphoriforme); whether or not libraries were pooled; and captured library concentration (ng µl −1 ). All models included an observation level random effect to account for overdispersion [37]. Pairwise interactions between bacterial species and the other variables were also considered. Model simplification was evaluated based on comparison of AIC, pseudo-R2 analysis was performed in the MuMIn library [38], and the final model was plotted in sjPlot [39].
The final model to evaluate variables influencing capture efficiency included the Ct value, pooling libraries before capture (yes/no), and captured library concentration variables, with an observation level random effect (Table S4, File S1). The model indicated that lower Ct is significantly associated with increased capture efficiency (Fig. 2). Captured library concentration was significantly positively associated with capture efficiency (Fig. S1), while pooling libraries prior to bait capture was negatively associated with capture efficiency (Fig. S2). The model accounted for 55 % of the variation in capture efficiency. We found that Ct explained 52 % of the variation in our data. In contrast, captured library concentration explained 13 %, and pooling only 1 %.

DISCUSSION
We have shown that broad, deep sequencing coverage can be obtained for M. amphoriforme and B. anthracis directly from samples using target-enrichment sequencing. These species are representative of the breadth of typical bacterial genome sizes (~1.0 MB and ~5.2 MB, respectively). For both species, we were able to recover a substantial portion of the genome at a sufficient depth of coverage for haploid variant calling where otherwise such data would have been extremely challenging or impossible to obtain. Since total sequencing read amounts are independent of enrichment performance, it is possible that even more of the study samples could have reached our coverage threshold (>15× across 80 % of the baited genome) had we performed additional sequencing to generate more reads per sample. Target-enrichment sequencing will be transformative for the molecular epidemiological study of these two species of public health importance, and more generally provides a valuable alternative when standard WGS following bacterial culture is impracticable [11,12,40,41], so long as a reference sequence is available to facilitate bait design. Culture-free methods such as targeted-sequence capture expand the opportunities for the genomic study of several Hazard Group 2+/3 pathogens, such as B. anthracis, within LMIC countries where they remain a threat to human and animal health. Whereas high-containment facilities are often lacking, sequencing platforms are increasingly available worldwide. Targeted-sequence capture also promises to facilitate our understanding of the genomic epidemiology of M. amphoriforme, which has only been isolated from patients on a few occasions due to its requirement for specific culture conditions and slow growth [19]. It has particular promise for studying pathogenic and antimicrobial resistance determinants, and thus aid timely clinical management of respiratory tract infections caused by this bacterium [17,19,21]. The developed bait sets and methods described in this study will provide a valuable resource for other researchers wishing to pursue the genomic study of these bacteria.
Our results suggest that target-enrichment sequencing is most effective for samples with higher pathogen-specific DNA concentrations, as indicated by Ct value. Ct was strongly correlated with capture efficiency (i.e. a higher proportion of mapped reads, indicative of less off-target capture), as well as the success of sequencing to a sufficient depth for confident genotyping, a finding consistent with previous studies [11,42]. At least two thirds of samples with Ct values ≤30 -indicative of higher starting pathogen concentrations in the samples -were successfully sequenced for both sample sets, whereas this success rate dropped to 10-13 % for samples with Ct >30. Based on previous studies implementing the udg gene as a PCR target for M. amphoriforme [18], a Ct of 30 would correspond to approximately 90-900 genome copies per PCR reaction, suggesting that target-enrichment sequencing will be most useful for pathogens that are present in moderate quantities in clinical samples. Generating genomic data for clinical samples with very low pathogen concentrations thus remains a challenge. While not assessed in this study, including a second round of bait capture (i.e. putting the enriched library through a second round of enrichment) has been suggested to improve sensitivity for such samples [43]. Although capture efficiency was also significantly associated with higher captured library concentration and with libraries not having been pooled prior to capture, the predictive value of these variables was low; Ct value alone could explain most of the variation in the data.
We found that pooling up to four libraries for capture still yielded the targeted coverage in nearly half of the pooled B. anthracis samples, thereby representing potential for even greater cost savings compared to sequencing without enrichment, by significantly reducing the per-sample bait capture reagent costs. The M. amphoriforme samples with libraries pooled in this study had a higher median Ct than unpooled samples (File S1), which likely accounts for the greater difference in capture efficiency observed between pooled vs. unpooled libraries for the two bacterial species (Fig. S2). Previous studies have been more consistent in achieving successful coverage when pooling as many as 20 libraries for Borellia burgdorferi [10] and Treponema pallidum [40]; this success may be attributed in part to carefully selecting libraries for pooling within a narrow range of similar Ct values as performed by Beale et al. [40] and as recommended in later versions of the myBaits manual (v5); therefore higher levels of sample pooling for capture may be feasible for B. anthracis and M. amphoriforme as well.
The ability to generate culture-free genomic data opens significant opportunities for wider phylogenetic studies of these two bacterial species of public health concern, and highlights the potential value of target-enrichment sequencing for facilitating an improved understanding of the epidemiology and evolution of many clinically important pathogens for which genomic data have been limited.