ConFindr: rapid detection of intraspecies and cross-species contamination in bacterial whole-genome sequence data

Whole-genome sequencing (WGS) of bacterial pathogens is currently widely used to support public-health investigations. The ability to assess WGS data quality is critical to underpin the reliability of downstream analyses. Sequence contamination is a quality issue that could potentially impact WGS-based findings; however, existing tools do not readily identify contamination from closely-related organisms. To address this gap, we have developed a computational pipeline, ConFindr, for detection of intraspecies contamination. ConFindr determines the presence of contaminating sequences based on the identification of multiple alleles of core, single-copy, ribosomal-protein genes in raw sequencing reads. The performance of this tool was assessed using simulated and lab-generated Illumina short-read WGS data with varying levels of contamination (0–20% of reads) and varying genetic distance between the designated target and contaminant strains. Intraspecies and cross-species contamination was reliably detected in datasets containing 5% or more reads from a second, unrelated strain. ConFindr detected intraspecies contamination with higher sensitivity than existing tools, while also being able to automatically detect cross-species contamination with similar sensitivity. The implementation of ConFindr in quality-control pipelines will help to improve the reliability of WGS databases as well as the accuracy of downstream analyses. ConFindr is written in Python, and is freely available under the MIT License at github.com/OLC-Bioinformatics/ConFindr.


INTRODUCTION
Public-health microbiology laboratories increasingly apply bacterial whole-genome sequence (WGS) 29 analyses for pathogen identification, high-resolution typing and risk profiling (Ronholm et   identifying inferior datasets; however, assessing contamination is outside of their current scope. 41 The presence of contamination in WGS data is recognized as an important sequence quality issue contaminants can occur at many stages in the generation of bacterial sequence data. For example, cultures 44 recovered from samples may not be adequately purified, or cross-contamination could occur during preparation of genomic DNA or sequencing-libraries (Merchant et al., 2014). Carryover contamination 46 results from the presence of residual fragments from previous sequencing runs (Souvorov et al., 2018).
While integration of controls can help to identify pervasive contamination issues, they are not effective for 48 the identification of sporadic contamination events. 49 Cross-species contamination in short-read WGS data can be readily identified by taxonomic classifica-50 tion of sequence reads using reference databases ( . Contamination may be indicated by a highly fragmented assembly, or a genome 56 size that is larger than expected (Robertson et al., 2018). However, establishment of appropriate cutoffs 57 requires determination of acceptable ranges within a species, and atypical strains may fall outside of these 58 limits.

59
Intraspecies contamination is far more difficult to detect because read-classification approaches cannot 60 be used. In some studies, the quality of metagenomic or single-cell sequencing data is assessed by single-copy genes (Parks et al., 2015). To our knowledge, there are no tools designed and evaluated 65 specifically for the detection of intraspecies contamination in bacterial-isolate sequence data. 66 We have developed a bioinformatics tool, ConFindr, which can accurately and rapidly identify intra-67 and cross-species contamination based on the analysis of raw sequencing reads. We evaluated the 68 performance of this tool for detecting contamination in Illumina short-read WGS data derived from 69 priority foodborne pathogens Listeria monocytogenes, Salmonella enterica and Shiga-toxin producing 70 Escherichia coli (STEC).   To search for multiple alleles, ConFindr begins by using BBDuk (Bushnell, 2014) to extract reads that 88 are likely part of the rMLST gene set. These baited reads are stringently trimmed, again using BBDuk, 89 and then aligned to the rMLST genes using BBMap (Bushnell, 2014). The resulting BAM file is then 90 parsed in order to find 'Contaminating Single Nucleotide Variants' (cSNVs) -that is, sites in the pileup 91 where more than one base is present. Since all rMLST genes are known to be present as single copies, the 92 occurrence of multi-base sites in the pileup indicates multiple alleles, and therefore contamination. To be 93 called as a cSNV, at least 2 bases of the minor variant with a Phred score of 20 or greater must be present 94 at that site, and at least 5 percent of bases must support the minor variant (though these parameters can be 95 changed by the user). ConFindr determines that a sample is contaminated if multiple genera are found in 96 the Mash screen step or if three or more cSNVs are found. To create in silico datasets, we selected complete assemblies from RefSeq for E. coli, S. enterica, and L. and 20 percent contamination using scripts found at https://github.com/lowandrew/FastQMixer to a total 104 coverage depth of approximately 60X. Five replicates were created for each mixed read set.

105
Test Datasets 106 We generated a test dataset of 48 samples comprised of intra-and cross-species mixes of E. coli, S. 107 enterica, and L. monocytogenes isolates with varying levels of relatedness (Table 1) Table S1.  contamination level was 2.5 percent or greater. We note that this is a fairly conservative cut-off and Identification of contaminating SNVs within rMLST genes using ConFindr 149 As ConFindr is based on finding contaminating SNVs (cSNVs) within the rMLST genes in raw reads, we 150 first looked at the reliability of detection of cSNVs in simulated data with different levels of contamination.

151
Synthetic mutants with 100 to 2000 SNVs relative to reference genomes were generated in silico, and the the minimum number of cSNVs that need to be found before a sample will be considered contaminated.

164
Over 80 percent of the sequence types have 16 or more SNVs relative to others. Therefore, ConFindr 165 should almost always be able to detect contamination between two isolates with different rMLST types.

166
ConFindr detects contamination with more sensitivity that existing tools 167 We compared ConFindr to existing tools capable of detecting contamination, CheckM and Kraken using  (Table 1).

173
ConFindr was more sensitive than either CheckM or Kraken for intraspecies contamination detection  Table S1), while both 177 CheckM and Kraken required either more distance between species or a higher level of contamination for 178 its determination.

180
We assembled the contaminated datasets used in this study to assess metrics such as number of contigs, 181 N50 and total length for contaminated datasets relative to uncontaminated datasets. At 5% contamination, 182 there was an increase in the number of contigs and the total length of the assembly, and a decrease in 183 N50 (     In creating ConFindr, we wanted a tool that would be broadly applicable to the bacterial domain, while 208 also providing enough resolution to detect contamination between closely related isolates. We selected  (Table 4). Typical assembly metrics vary among species and strains, making it difficult to develop robust standards for these metrics. For example, S. enterica genomes tend to have higher N50 250 values and assemble into fewer contigs than E. coli (e.g., Table 2). Ultimately, this variability makes it 251 difficult to develop standard cutoffs that can be integrated into automated tools. 252 We applied ConFindr to the evaluation of 1500 samples in the public SRA repository and identified 253 intraspecies contamination in 5.13% of the samples (Table 3). Notably, intraspecies contamination was 254 more prevalent than cross-species contamination. A recent assessment of 67758 publically-available We gratefully acknowledge Julie Shay, Mohamed Elmufti and Austin Markell for helpful comments on 301 the draft manuscript. We would also like to thank Katie Eloranta and Jennifer Liu for providing sequence 302 data used in Table 4 of the manuscript, and users who have tested ConFindr and suggested improvements 303 and modifications.