The importance of recognizing and reporting sequence database contamination for proteomics

Advances in genome sequencing have made proteomic experiments more successful than ever. However, not all entries in a sequence database are of equal quality. Genome sequences are contaminated more frequently than is admitted. Contamination impacts homology-based proteomic, proteogenomic


Opinion paper
Genome sequencing has exploded, and the wealth of information generated from genome sequences completed and annotated over the last decade has allowed the reconstruction of individual genomes for non-cultivable organisms from metagenomic data, the determination of community diversity and structure from environmental samples [1] and the exponential increase in protein identifications in proteomic and metaproteomic studies [2,3]. The rapid progress in both genomics and proteomics has been lauded and the technical challenges raised, but the time has come to talk about the proverbial elephant: our sequences are contaminated. The objective of this paper is to highlight the need to understand how genomic data are obtained, compiled and archived, and the importance of curated data repositories for proteomics of non-model organisms and metaproteomic surveys.
The presence of contaminant reads, contigs originating from DNA extraneous to the organism of interest, comes as no surprise, especially to those who perform the sequencing. The Department of Energy Joint Genome Institute, a popular resource behind many genomes, issues the following warning to users of its Microbial Single Cell Program: "Despite our best efforts, it is likely that there are some contigs in your single cell genome(s) that are from contaminant organisms. Common contaminants that are known to be in the reagents we purchase are Delftia, Pseudomonas, and Ralstonia. Other contaminants that we commonly see are Propionibacterium and Lactobacillus. In addition, there may be contaminants from your particular sample in the form of free DNA that made it into the well along with your single cell. Although we do an automated screen of your data for the known common contaminants and this information is provided to you in the JGI Single-cell Assembly QC report, this data is not removed because this could result in the removal of legitimate, highly conserved genes from your genome" [4]. Contamination has many sources beyond consumables and reagents at the sequencing facility, even when the best laboratory practices are used. Contamination can occur at any point during sample preparation and DNA extraction [5], and even the supposedly pure cultures and individual organisms used to generate the source DNA can prove to be co-cultures or to have symbiotic partners whose DNA finds its way into sequencing reactions [6]. For example, the discovery and clinical detection of a novel parvovirus-like hybrid virus was later demonstrated to have come from infected diatoms used to produce the silica in DNA extraction spin columns [5], and quantitative PCR revealed that a putatively pure culture of the current-producing bacterium Geobacter sulfurreducens DL1 was in fact a co-culture of two G. sulfurreducens strains, despite the use of standard microbiological techniques to maintain pure cultures [6].
When contaminant sequences are not removed, they confound taxonomic identification in metagenomic samples using techniques such as fragment recruitment [7] and they translate to improperly classified protein sequences in databases such as the widely used non-redundant (nr) protein database maintained by the National Center for Biotechnology Information (NCBI). Genome sequence quality control measures exist, but their use is not universal. Standard parameters for sequence quality control include GC content, the expected genome length, the number of reads, and the N50 [8]. Additional algorithms, such as Kontaminant, can look for overlap between newly generated sequences and specified reference contaminants [9]. Thus, a myriad of criteria and automated programs exist to assist in the proof-reading of genome sequences, but their use is neither mandatory nor consistent. The burden of quality control for sequences in public repositories is generally placed upon those submitting the sequences, but information regarding what measures have been taken, if any, is not transmitted to database end-users. The NCBI Reference Sequence (RefSeq) collection is a manually curated subset of the NCBInr database; however much of the additional standards are focused on genome annotation (see inset), and again a range of tools exist [10]. Furthermore, the status of microbial genomes in the RefSeq database is typically "provisional," meaning, "The RefSeq record has not yet been subject to individual review. The initial sequence-to-gene association has been established by outside collaborators or NCBI staff" [10]. The Universal Protein Resource Knowledgebase (UniProtKB) is similarly based on nucleotide submissions to databases like the NCBI's GenBank [11]. Like RefSeq, entries in the SwissProt collection of UniProt, as opposed to TrEMBL, which is not curated, are further subject to extensive manual curation, again generally focused on annotation.
Results of the 2010 NCBI workshop on genome annotation [12] 1. MINIMAL GENOME ANNOTATION SHOULD HAVE a. rRNAs (5S, 16S, 23S) and corresponding genes with locus tags, b. tRNAs and corresponding genes with locus tags, c. protein-coding genes with locus tags (see below) and corresponding CDS. A non-exhaustive catalog of reliable software, sources, and databases for the production of microbial genome annotation is both a useful community resource that aids in producing high quality genome annotation. A catalog will be reported here.

VALIDATION CHECKS AND ANNOTATION MEASURES.
Validation checks should be done prior to the submission. NCBI has already provided numerous tools to validate and ensure correctness of annotation. Additional checks will be put in place to ensure the minimal standards are met.
With all of the various quality control standards and methods for meeting them, it is difficult for end-users to evaluate the trustworthiness of database entries, so by default most sequences are given the benefit of the doubt. In the absence of proof of error, most sequences are assumed to be correctly assembled and annotated and attributed to the correct taxonomic group. However, during the course of an analysis, it sometimes becomes obvious that something is amiss. Notable recent cases include the mis-annotation of a keratin-derived sequence as a plant protein [13] and the mistaken inclusion of bacterial sequences in the genomic sequences of two Caenorhabditis species [14]. A string of such incidents inspired us to perform a cursory survey of the NCBInr database, revealing a glimpse at the extent to which contamination is present. We present the following BLAST alignments as evidence of sequence contamination.
The first taxon that we identified as suspicious was the bacterium Enterococcus gallinarum EGD-AAK12 (taxid: 1357296), which supposedly contains 12,518 genes, 12,300 of which are annotated as coding sequences (CDS). This 11 Mbp genome of this Firmicute of the order Lactobacillale is in the "scaffolds or contigs" status, with 6194 contigs and a contig N50 at 3468 nt. Its proteome is available in the NCBInr but is not included in the RefSeq protein database. Because of the discrepancy between the genome size of this strain and those of other Enterococcus bacteria (generally approximately 3 Mbp), we examined this organism using BLAST queries. We created a proteome fasta file (ORGfasta) using an http request in the form http://www.ncbi.nlm.nih.gov/protein/?term=txid1357296 [Organism:noexp] (for taxid: 1357296) and post-processed using a Python script to only retain non-redundant sequences. To identify which sequences could be related to a contamination of the proteome by a different organism, the ORGfasta proteome was processed as follows: (i) a first BLASTp search was performed using a small subset of ORGfasta against the NCBInr database at an evalue threshold of 1E-20 to identify the contaminating genus (CONTgenus), (ii) a list of GIs associated to CONTgenus children taxa was compiled using the NCBI taxonomy gi taxid prot.dmp file, (iii) fasta entries corresponding to this GI list were retrieved from NCBInr to build a BLAST database (CONTgenus BLASTdb), (iv) a BLASTp search of the ORGfasta proteome against the CONTgenus BLASTdb was performed, and (v) the output BLAST xml was parsed using a Python script to list ORGfasta sequences with hits at an evalue below 1E-20 for sequences of the exact same length on any CONTgenus sequence. A reference proteome (REFfasta) created using a strain of the same genus whose genome is complete, Enterococcus faecium DO, was submitted to the same BLAST search against the CONTgenus BLASTdb. An Enterococcus BLAST database excluding both the ORGfasta and REFfasta sequences was built, and the ORGfasta and REFfasta sequences were searched against this database, again using BLAST (Table 1). Of the 12,300 predicted E. gallinarum EGD-AAK12 proteins, 3198 were almost identical to a Klebsiella sequence (same length, BLAST evalue <1E−20), whereas for Enterococcus faecium DO, the closest "complete" reference organism, only 103 sequences met these criteria, giving an estimate for the number of false positives. Thus, at least one Klebsiella strain was likely present as a contaminant in the DNA used for sequencing. Only 244 E. gallinarum EGD-AAK12 sequences are almost identical to an Enterococcus sequence (excluding ORGfasta and E. faecium sequences), whereas E. faecium DO yielded 4122 hits. The high representation of species close to E. faecium in the Enterococcus database could explain the number of hits for E. faecium DO, but the low number of hits for E. gallinarum EGD-AAK12 raises questions. Because of the presence of Klebsiella, a Proteobacteria of the order Enterobacteriales, in the sample and because of the generation of short reads in the Illumina sequencing approach, assembly errors could lead to erroneous chimeric protein sequences in the output, reducing the number of correctly assembled coding sequences and thus reducing number of sequences shared with other enterococci.
The second dubious taxon that was selected for investigation was Ceratitis capitata (taxid: 7213), which is a eukaryote from the order Diptera, well-known for causing extensive damage to fruit crops. This organism was examined as for the first example, and the results are given in Table 2. Another well-known Diptera, namely Drosophila melanogaster (taxid: 7227, Bioproject 164), was used as the reference proteome. The C. capitata genome is in the "scaffolds or contigs" status, and its proteome is available both in the NCBInr and RefSeq protein databases. We detected 789 C. capitata proteins that were almost identical to sequences from the Escherichia genus (same length, BLAST evalue <1E−20). In comparison, only nine such matches were obtained from the 21,402 non-redundant D. melanogaster sequences. As in the previous example, the number of BLAST hits on a Diptera order database excluding sequences from both proteomes is much higher for D. melanogaster than for C. capitata, which strengthens the hypothesis that taxid: 7213 is contaminated with Escherichia sequences. This contamination could originate from the gut microbiota because Enterobacteriaceae are the dominant bacterial family in the fruit fly gut [15], although the contamination could also have been introduced from the laboratory or sequencing facilities. While the genome of C. capitata is much larger than that of Escherichia, the presence of multiple enterobacteria sequences may lead to assembly errors and chimeric protein sequences. Proteomic experiments on this species would be even more challenging than on the contaminated bacteria in the first example because of the lack of homologous sequences, diptera genomes being until now poorly populated. The low level of MS/MS spectra assignments that would likely be obtained from such experiments may first be attributed to population polymorphisms (genetic diversity) of the C. capitata sample, masking the contamination effect.
These examples serve as reminders that researchers need to be vigilant when it comes to using sequence databases. Proteogenomics has been proposed as a routine procedure to augment genome annotation using empirical evidence for valid sequences [16], and proteogenomics studies should be performed for any novel organism belonging to a poorly characterized branch of the Tree of Life. The use of extended databases such as the NCBInr for metaproteomics, comparative proteogenomics, and homology-based proteomics [17] makes it even more crucial to improve the accuracy of this sequence database. Whether discovered by discrepancies in the GC content, taxonomy, or BLAST results, contamination is a real problem that can be identified at many stages, including in the course of proteomic studies that occur long after the sequences have been completed and deposited in the database of choice. In such cases, database end-users have the responsibility to indicate those discrepancies. Databases, especially RefSeq and UniProt, which require a large manual curation component [18] welcome outside contributions to the effort to maintain sequences of the highest quality and correctness. Potential errors in RefSeq can be signaled through their website at http://www.ncbi.nlm.nih.gov/projects/RefSeq/update.cgi. Similarly, UniProt welcomes suggestions either through their website at http://www.uniprot.org/contact or by e-mail addressed to help@uniprot.org. While manual curation is a valiant effort, new automated tools are needed to survey existing databases for deviations in quality.