UNCROSS: Filtering of high-frequency cross-talk in 16S amplicon reads

Next-generation amplicon sequencing is widely used for surveying biological diversity in applications such as microbial metagenomics, immune system repertoire analysis and targeted tumor sequencing of cancer-associated genes. In such studies, assignment of reads to incorrect samples (cross-talk) is a well-documented problem that is rarely considered in practice. By considering unexpected OTUs in artificial (mock) samples, I estimate that cross-talk occurred for ~2% of the reads in one Illumina GAIIx run and eleven Illumina MiSeq runs targeting 16S ribosomal RNA. I also describe UNCROSS, an algorithm for detecting and filtering cross-talk in OTU tables.


Introduction
Recent examples of next-generation amplicon sequencing experiments include the Human Microbiome Project (HMP Consortium, 2012), an analysis of the response of the human immune system to influenza vaccination (Jiang et al., 2013) and a high-throughput search for known cancer-relevant variants in 16 oncogenes (Hadd et al., 2013). In such studies, samples are multiplexed into a single run by embedding index sequences into amplicons to identify the sample of origin. Index sequences are sometimes called tags or barcodes, but I will avoid the latter terms here as some authors use them to refer to the biological sequence in an amplicon. An index sequence can be annealed to the start of the amplicon (Caporaso et al., 2011;Derakhshani et al., 2016) (single-indexing), while dual-index schemes attach indexes to both the ends of the construct (Kozich et al., 2013;Derakhshani et al., 2016). Previous studies have revealed unexpectedly high rates of cross-talk in both 454 (Carlsen et al., 2012) and Illumina (Kircher et al., 2012;Nelson et al., 2014) data.
Indexing methods designed to mitigate cross-talk have recently been proposed by (Esling et al., 2015) and (Schnell et al., 2015). Here, I investigate cross-talk in reads from one Illumina GAIIx run (Caporaso et al., 2011) and eleven MiSeq paired-end sequencing runs (Kozich et al., 2013) targeting the 16S gene. I describe UNCROSS, an algorithm for crosstalk detection in OTU tables, and show that it successfully identifies ~80% of spurious OTU entries due to cross-talk in these runs.

Results
GAIIx reads were kindly provided by the authors as they are not deposited in the Short Read Archive as stated in Caporaso et al. 2011. They include 25 in vivo samples from different environments and three replicates of a designed (mock) community containing 67 strains. A single-index scheme was used with a 6-base index sequence. I created a partial reference database of 16S sequences for the mock community by matching species names to the Living Tree Project subset of the SILVA database (Yilmaz et al., 2014). These sequences may have some differences compared to strains in the mock samples. I was unable to find reference sequences for nine of the species in the community.
MiSeq reads were obtained from http://www.mothur.org/MiSeqDevelopmentData.html, accessed 10th Jan. 2016. These are paired-end reads from eleven different MiSeq runs using three different versions of the Illumina Real-Time Analysis (RTA) and MiSeq Control Software (MCS) ( Table S1 in Kozich et al., 2013). Twelve samples were sequenced in each run: three replicates of a mock sample with 21 species which was designed (Haas et al., 2011) for the Human Microbiome Project, plus three replicates obtained from human gut, mouse gut and soil samples, respectively. A reference database for the HMP mock community was included in the download. A dual-index scheme was used allowing up to 96 distinct samples.
I created OTUs using UPARSE (Edgar, 2013). MiSeq read pairs were merged using a Bayesian assembler to ensure that consensus base calls and quality scores are correctly calculated in the overlapping segment (Edgar and Flyvbjerg, 2014). Quality filtering was performed using a maximum expected error threshold of one so that the most probable number of errors in each merged read is zero according to its quality scores (Edgar and Flyvbjerg, 2014). Merged reads with lengths <230nt or >270nt were discarded to select the V4 hypervariable region. An OTU table was generated by aligning reads to OTU sequences using USEARCH (Edgar, 2010). A read was assigned to the OTU with highest identity, or discarded if the top hit had <97% identity. For GAIIx reads, sample names were obtained by requiring an identical match to an index sequence. The posted MiSeq reads were already demultiplexed so sample identifiers were taken from the FASTQ filenames; for example, reads in Soil3_S6_L001_R1_001.fastq were assigned to sample Soil3. OTUs were classified by comparing their sequences first to the mock community reference database, then to SILVA (Pruesse et al., 2007) if a match with ≥97% identity was not found.
In all datasets, mock samples were found to have many more OTUs than expected from the designed community composition. In the GAIIx reads, 1,522 OTUs have one or more reads assigned to the mock samples, far more than the ~45 clusters obtained by clustering the known V4 sequences at 97% identity. In the MiSeq data, the runs have up to 727 mock OTUs with nine of the eleven datasets having >200 (Table 3). In all twelve datasets, most of the unexpected mock OTUs (i.e., those which do not match a reference sequence for a designed strain) have high abundances in the environmental samples and most or all these are therefore probably due to cross-talk. Table 1 shows the 25 OTUs from run 130417 with the highest mock abundances. The unexpected mock OTUs (i.e., those which do not match a designed strain) have high abundance elsewhere. For example, OTU EF400979 has 73,265 reads in human gut and 393 reads in the mock samples. Similarly, all of the OTUs with high abundance in the mock samples are often found in low abundance in the environmental samples which is also strongly suggestive of cross-talk, though this is less clear as several of the mock species are human pathogens and thus could plausibly be present in vivo, especially in human gut. In Table 1, OTU table entries were annotated manually as cross-talk, valid, contaminant or overlaid by considering the most likely explanations for the reported counts. An overlaid entry is inferred to be present in both mock and environmental samples. Table 2 shows manual annotations for the 25 most abundant OTUs in the GAIIx data.
Notably, none of the 400 counts in this table are zero despite the different environments and the fact that most OTUs are not expected in the mock samples. This can be explained by observing that a large majority (361/400 = 90%) of the OTU table entries are consistent with cross-talk and most of these should therefore probably be zero. The correct number of non-zero counts in these 25 OTUs is estimated to be approximately 400-361 = 39, an order of magnitude fewer than the 400 obtained without correcting for cross-talk.
To perform manual annotation, I examined each OTU in the table. If the lowest-abundance samples in a given OTU have much lower counts than the high-abundance samples, they are inferred to be probable cross-talk. In a mock sample, a high-abundance unexpected OTU, i.e. an OTU which does not match a species in the designed community, is probably a contaminant. A low-abundance unexpected mock count is probably cross-talk if it is also present in another sample. An alternative explanation is a low-abundance contaminant in the mock sample which is a valid OTU in the environmental samples by coincidence; this is a much less likely explanation. Another possible explanation is contamination which affects multiple samples, e.g. flow-cell residue from previous runs (Nelson et al., 2014); this is also considered to be less likely than cross-talk. Under these assumptions, mock samples enable a more sensitive test for the presence of cross-talk. For example, if an unexpected mock OTU has two reads and some other sample has ten reads then the most likely explanation is cross-talk. The anomalously large cross-talk rate of 2/12 = 17% of the reads can be explained by fluctuations due to sampling effects when there are small total numbers of reads, which can result in high outlier values for some OTUs. In environmental samples, OTUs cannot be considered as expected or unexpected so abundances of two and ten in an OTU with twelve total reads is not a reliable indicator of cross-talk.
The UNCROSS algorithm described below uses simple heuristics to automate the manual procedure described above for annotating cross-talk. UNCROSS-Ref predicts cross-talk in mock samples using a reference database containing only expected sequences in the designed community. UNCROSS-Denovo predicts cross-talk in all samples considering read counts alone without using a database. These approaches are complementary. UNCROSS-Ref can identify unexpected OTUs by comparison with the database and is thus more sensitive to cross-talk in OTUs with low overall abundance, but cannot detect or correct cross-talk in environmental samples. UNCROSS-Denovo is less sensitive to cross-talk in OTUs with low overall read counts, but can detect cross-talk in environmental samples and can thus be used to detect and correct cross-talk in practice.

UNCROSS algorithm
For a given OTU, let a low count be greater than zero and small enough to infer that most or all the reads for this sample are probably due to cross-talk. A high count is large enough to infer that most of the reads were correctly assigned to its sample. An undermined count is too large to be low and too small to be high (Fig. 2). UNCROSS uses simple heuristics to classify counts as low, undetermined or high. For a given OTU, variables are defined as follows.
S is the number of samples.
N is the total number of reads for all samples.
NT is the number of reads which are assigned to the wrong sample.
MT is the mean number of cross-talk reads per sample = NT/S. R is the cross-talk rate = NT/N. nH is the total number of reads in high counts, i.e. an estimate of the total number of reads in valid non-zero entries.
sL is the number of samples with low counts.
nL is the sum of low counts, i.e. an estimate of the total number of reads in counts which are non-zero due to cross-talk.
mL is the largest low count.
mavg is the mean low count = nL/sL. mmax is the maximum number of reads assigned to a mock sample.
nmax is the maximum number of reads assigned to a non-mock sample.
mavg is the mean number of reads assigned to a non-mock sample. fdn = mL/N is the maximum cross-talk frequency estimated by UNCROSS-Denovo. fref = mmax/N is the maximum cross-talk frequency estimated by UNCROSS-Ref.

rref = S mavg/N is the UNCROSS-Ref estimate of the cross-talk rate (R) (calculated only for
OTUs where mock reads are predicted to be due to cross-talk).
rdn is the UNCROSS-Denovo estimate of the cross-talk rate (R) (calculated only for OTUs where cross-talk is predicted).

UNCROSS-Ref algorithm
The UNCROSS-Ref algorithm classifies an OTU as follows. If the total number of reads assigned to mock samples is zero, the OTU is not analyzed. If the sequence matches the reference database for the mock community, the OTU is classified as designed. Otherwise, the mock reads must be due to contamination or cross-talk, which is decided per the following pseudo-code.
Cross-talk endif endif

UNCROSS-Denovo algorithm
The UNCROSS-Denovo algorithm classifies an OTU by considering the counts for each sample (Fig. 2). Non-zero counts are classified per the following rules.
A minimum value v for a valid count is calculated as follows: if N < 10 then v=5; elseif N < 100 then v=N/10; else v=N/50. Thus, for an OTU with at least 100 reads, a count of at least 2% of the reads is classified as valid. The sum of valid counts is V.
A maximum value x for a cross-talk count then is calculated as follows: if V < 10 then x=1; elseif V < 100 then x=V/10 + 1 else x=V/200. Counts which are neither valid nor cross-talk are classified as undetermined.

Discussion
In the data considered here, cross-talk is clearly identifiable in control samples of known composition (so-called mock communities). Unfortunately, mock samples are rarely included in practice. In fact, to the best of my knowledge, the runs analyzed here are the only public datasets where this type of analysis is possible. If cross-talk is present with frequencies comparable to those estimated here, diversity measures may be significantly degraded. Most OTUs assigned to the mock communities were spurious due to cross-talk, inflating OTU "richness" by an order of magnitude (Table 3). Alpha diversity metrics and estimators will be correspondingly inflated. Beta diversity measures will also be overestimated if some samples have a long tail of shared but spurious OTUs which in fact are not present in those samples. These problems may be more serious when samples from distinctly different environments are sequenced in the same run. When samples from similar environments are compared, then cross-talk degrades the ability to make present / absent inferences for OTUs that have strongly varying abundance associated with certain metadata (e.g., before and after treatment with an antibiotic).
The UNCROSS algorithm uses simple heuristics that attempt to distinguish spurious OTU table entries due to cross-talk which should be zero from valid entries with low abundance.
While UNCROSS works quite well on the datasets tested here, it was trained on the same data (because no other candidate training datasets are available, to the best of my knowledge) and it may be less accurate on other datasets. If there are no mock samples, and/or the number of samples is large, then automated de novo cross-talk detection may be more difficult or impossible, noting that cross-talk may have quite different rates and biases in other runs. If there are ~100, then the average OTU entry will have ~1% of the reads which is comparable to the maximum cross-talk frequency observed in the datasets tested here. This implies that cross-talk may be impossible to detect in most OTUs and that even present / absent inference for a given OTU in a given sample may be impossible in many or most cases. In conclusion, cross-talk is a well-documented but often neglected issue that should always be considered when analyzing multiplexed amplicon reads.