Prospecting for Zoonotic Pathogens by Using Targeted DNA Enrichment

More than 60 zoonoses are linked to small mammals, including some of the most devastating pathogens in human history. Millions of museum-archived tissues are available to understand natural history of those pathogens. Our goal was to maximize the value of museum collections for pathogen-based research by using targeted sequence capture. We generated a probe panel that includes 39,916 80-bp RNA probes targeting 32 pathogen groups, including bacteria, helminths, fungi, and protozoans. Laboratory-generated, mock-control samples showed that we are capable of enriching targeted loci from pathogen DNA 2,882‒6,746-fold. We identified bacterial species in museum-archived samples, including Bartonella, a known human zoonosis. These results showed that probe-based enrichment of pathogens is a highly customizable and efficient method for identifying pathogens from museum-archived tissues.


Panel Development
We developed a set of biotinylated probes for UCE-based, targeted sequencing of 32 pathogen groups (Table 1). For example, given the large evolutionary distances covered by various pathogens, we generated sets of probes that target more discrete taxonomic groups (e.g., Nemotoda, Yersinia). For bacterial pathogens, probes were designed to capture all species within the genus or species group. For eukaryotic pathogens, probes were designed to be effective at taxonomic ranks that ranged from species group to class. The taxonomic rank varied in eukaryotic pathogens based on the following criteria: 1) the number of available genomes, 2) sequence diversity -because this impacted the number of probes needed. Table 1 provides information on the pathogen group, targeted zoonotic agent and zoonoses.
For each group we used the Phyluce package v1.7.0 (1,2); we generated probes to target ≈49 loci using the methods described below. First, we identified orthologous loci between a focal pathogen and the remaining species in the pathogen group. Focal taxa were chosen based on their assembly contiguity or prominence as a zoonotic agent. To do this we downloaded a genome for each species in the pathogen group. Accession numbers for these assemblies are provided in Table 2. Next, we simulated 25x read coverage for each genome using the ART v2016.06.05 (3); read simulator with the following options: art_illumina-paired-len 100-fcov 25-mflen 200-sdev 150 -ir 0.0 -ir2 0.0 -dr 0.0 -dr2 0.0 -qs 100 -qs2 100 -na. Simulated reads from all query taxa were mapped back to a focal taxon with bbmap v38.93 (4); enabling up to 10% sequence divergence (minid = 0.9). Unmapped, or multimapping reads were removed using Bedtools v2.9.2 (5) and phyluce_probe_strip_masked_loci_from_set (filter_mask 25%). The remaining reads were merged to generate a BED file containing orthologous regions between the query and focal taxa.
Then, we identified orthologous loci among all taxa within the pathogen group using phyluce_probe_query_multi_merge_table. Next, we filtered each set of loci to retain only those shared among 33% of taxa in the pathogen group using phyluce_probe_query_multi_merge_table. We extracted 160 bp from each locus and generated an initial set of in silico probes directly from the focal genome using phyluce_probe_get_genome_sequences_from_bed and phyluce_probe_get_tiled_probes.
Additional options for probe design included generating two probes per locus (-two_probes) that overlapped in the middle (-overlap-middle). Focal probes with repetitive regions or skewed GC Page 3 of 8 content (<30% or >70%) content were removed. Next, the probes from the focal taxa were mapped back to each genome in the pathogen group with phyluce_probe_run_multiple_lastzs_sqlite. We used the -identity option to limit searches with a maximum divergence of 30%. Using these results, we extracted 120-bp loci from the probed regions in each representative genome extracted using phyluce_probe_slice_sequence_from_genomes. Theoretically, this dataset should contain orthologous 120-bp sequences from most taxa in each pathogen group. We verified this with phyluce_probe_get_multi_fasta_table, which provides a table showing the number of taxa identified at each locus. We used this information to identify the 100 loci capable of capturing most taxa from the pathogen group. Next, we generated two 80-bp probes from each of the 100bp and 120-bp loci. We used phyluce_probe_easy_lastz to compare the probes to themselves and remove any that were possible duplicates. Then we reduced the probe set even further by clustering probes based on sequence identity with cd-hit-est v4.8.1 (6). We identified sequence clusters with >95% similarity and retained only 1 probe per group. Finally, we recalculated the number of probes needed to capture each locus.
The proceeding steps were repeated for each pathogen group shown in Table 1. To generate a final panel, we selected 49 loci per pathogen group in a way that minimized the number of probes needed. In some cases, we needed to generate 2 sets of probes to adequately represent target pathogens. For example, Kinetoplastea contains 2 pathogens of interest, Trypanosoma and Leishmania. The baits designed for Leishmania were able to target all 49 loci in the most of the Kinetoplastea but only 23 loci in Trypanosoma. We then generated a second set of 617 Trypanosoma-specific baits to augment the kinetoplastid baits and ensure that Trypanosoma parasites were represented adequately in the final panel. Likewise, we doubled the number of baits used to capture loci from the Bacillus cereus group to effective capture B. cereus and B. anthracis. The probe set was quality checked by Arbor Bioscienes. This included comparing the probe set to mammal genomes with blastn v2.12.0 (7) and checking for lowcomplexity sequences. Any probes that failed quality control were replaced before synthesis.

Library Preparation
Standard DNA sequencing libraries were generated from 500 ng of DNA per sample. We Individual samples with similar DNA concentrations were combined together into pools of 4-16 samples and the total volume was reduced to 7 µL with a speedvac vacuum concentrator.
Next, we used the high sensitivity protocol of myBaits v.5 (Daicel Arbor Biosciences) to enrich target pathogen loci from the host/pathogen control and museum archived samples. We used 2 rounds of enrichment for each pool of samples. Probe concentration was 100 ng/µL. Each round was 24 hours at 65°C. After washing of unbound DNA, each library was amplified with a 15cycle PCR amplification step and quantified using qPCR. Finally, the pools of 4-16 were combined into an equimolar pool for sequencing. All sequencing reactions were on single lanes of Illumina Hi-Seq 2500.

Bioinformatic Analyses
All analyses were performed on a single compute node with 48 processors and limited to 100 Gb of RAM. Bioinformatic steps were documented in a series of BASH shell scripts or Jupyter notebooks. These files along with conda environments are available at github.com/nealplatt/pathogen_probes and archived. The basic structure of the bioinformatic analyses are shown in Figure 3. In general, we used the Kraken2 v2. 1.2 (8) to assign a taxonomic id to each read, the Phyluce v1.7.1 (1,2) pipeline to identify, assemble, and align loci, and RaxML-NG v1.0.1 to generate phylogenies from each pathogen group of interest.
First, we used Trimmomatic v0.39 (9) to trim and quality filter low-quality bases and Illumina adapters. Then, we used Kraken2 v2.1.1 (8) to compare each read from our samples to a reduced dataset of target loci using a -conf cutoff of 0.2. We decided to compare our reads to a reduced dataset of target loci to minimize the computational expense of these comparison. To generate the reduced database of bait-targeted loci, we downloaded one representative or reference genome from all species in RefSeq v212 (10) with genome_updater.sh v0.5.1 (https://github.com/pirovc/genome_updater). Then we used BBMap v38.96 (4) to map all the baits to each genome and a kept the 10 best sites that mapped with ≥85% sequence identity.

Page 5 of 8
Next, we extracted these hits along with 1,000 bp up and downstream. These sequences were combined into a single fasta file that should contain the major mapping locations for our baits.
Once reads were classified we identified genera that were known pathogens or were present in at least one sample with more than 1,000 reads. Next, we extracted reads from the relevant family with KrakenTools v1.2 (https://github.com/jenniferlu717/KrakenTools/). These reads were then assembled (Figure 3, panel B) with the SPAdes genome assembler v3.14.1 (11) using the phyluce_assembly_assemblo_spades wrapper script. We filtered out low quality contigs based on size (<100 bp) and median coverage (<10×) as calculated by the SPAdes genome assembler. Next, we filtered individuals even further by removing individuals with fewer <2 contigs.
While we were assembling and filtering contigs from each isolated target loci from species with available genome assemblies, we used genome_updater.sh v0.5.1 (https://github.com/pirovc/genome_updater) to download one (-A 1) reference or representative (-c reference genome, representative genome) genome from either refseq or Genbank (-d refseq,genbank) for the pathogen group. We also included at least 1 individual from an outlier genus to root downstream analyses. These genomes were converted to twobit format with faToTwoBit. Next, we used phyluce_probe_run_multiple_lastzs_sqlite to compare probes from the pathogen group to the genome assemblies with an identity cut off of 85% (-identity 0.85).
These loci plus 1 kb of flanking sequence (-flank 1000) were extracted from the genome using phyluce_probe_slice_sequence_from_genomes. After extraction, the sliced loci were identified and counted using phyluce_assembly_match_contigs_to_probes (-min-identity 90) and phyluce_assembly_get_match_counts. Next, we combined the loci generated from our samples with those from representative and reference genomes and aligned them with phyluce_align_seqcap_align. The resulting alignments were trimmed with gblocks v0.91b (12) and phyluce_align_get_gblocks_trimmed_alignments_from_untrimmed. We then counted the number of taxa per locus alignment (phyluce_align_get_taxon_locus_counts_in_alignments) and removed taxa with fewer than 2 loci (phyluce_align_extract_taxa_from_alignments). Then we removed any loci that contain fewer than half of the expected number of taxa with phyluce_align_get_only_loci_with_min_taxa and concatenated the remaining loci into a single phylip alignment (phyluce_align_concatenate_alignments).

Page 6 of 8
We used RaxML-NG v1.0.1 (13) to generate a maximum-likelihood phylogenetic tree from the concatenated alignment. We ran 100 parsimony tree searches and then another 1,000 replicates using the GTR + G substitution model. Branches with less than 50% support were collapsed with the Newick Utilities v1.6 (14), Newick editor (nw_ed <input_tree_file>'i and b< = 50'). These steps were then repeated with other pathogen groups identified in the samples.

Host Identification
We verified museum identifications by comparing reads to a second Kraken2 v2.1.2 (8) database containing mammalian mitochondrial genomes. To do this, we downloaded all available mammalian mitochondrial genomes (n = 1,651) from https://www.ncbi.nlm.nih.gov/genome/organelle/ (last accessed 3 November 2022). We then created a custom database and compared each of our samples sing Kraken2 and no confidence cutoffs. The Kraken2 classifications were filtered by removing any samples with fewer than 50 classified reads and any single-read, generic classifications.