Rengasvirus, a Circular Replication-Associated Protein-Encoding Single-Stranded DNA Virus-Related Genome That Is a Common Contaminant in Metagenomic Data

We report the genome of a circular Rep-encoding segmented or satellite virus, which we have provisionally named rengasvirus. In metagenomic studies of virus-enriched fractions, rengasvirus was detected widely, including in reagent-negative controls. We thus report this genome to help others recognize a probable contaminating sequence.

H ere, we describe a circular replication-associated protein (Rep)-encoding single-stranded (CRESS) DNA virus-related genome discovered in metagenomic data from human subjects and also in negative controls, suggesting that it originates from laboratory reagents. The rengasviral (rengas = Finnish for "ring") sequence was first detected in metagenomic sequence data from bronchoalveolar lavage (BAL) fluid samples from lung transplant recipients (1). Default parameters were used for all software except where otherwise indicated.
Contig building, open reading frame (ORF) prediction, and mapping of reads to contigs were performed using the Sunbeam pipeline version 2.1 (2); candidate CRESS viruses were identified by screening against vFam models for CRESS viral Reps using HMMER version 3 (3,4). We confirmed the circularity of these sequences using PCR by amplifying around the DNA circles. For this, we used divergently oriented "back-to-back" primer pairs and recovered product bands of genome length. This was repeated with two primer sets binding to different locations on the circular DNA (set A forward [Fwd], GGCAGATCTAGATCA CTACTCTGGAC; set A reverse [Rev], GCCAATGCGGGAGTAAATAGCTTG; set B Fwd, CCCTATCACTCTATAACATAACAAATGTCATTAGG; set B Rev, GGGTAATACTGATCCTATCACTC CTTTATAAC). We identified 105Â coverage of the PCR-confirmed rengasvirus full-genome sequence in the sample in which we initially identified the rengasvirus sequence (SRA run accession number SRR5826708).
The rengasvirus genome is a 1,045-bp DNA sequence containing a single ORF encoding the Rep with a GC content of 49.7% (GenBank accession number MW559600). Based on BLASTp searches, the closest reported Rep amino acid sequences were from a circular DNA molecule from a glacial ice core (QGF19362.1), a CRESS virus helicase (AWW06123.1), and a dragonfly larva-associated circular virus (ALE29688.1), with sequence identities of 51.49%, 42.91%, and 41.11%, respectively (online search, February 2021). A maximum-likelihood phylogenetic tree of Rep placed rengasvirus as a member of the CRESSV2 viral cluster (Fig. 1A) (5).
To investigate the prevalence of rengasvirus sequences, we interrogated publicly available metagenomic data sets for homologous sequences generated by our lab and by other groups. Alignments were performed using the hisss pipeline (https://github .com/louiejtaylor/hisss), described in reference 6, which uses grabseqs and sra-tools to access public metagenomic data, Bowtie 2 (option, -very-sensitive-local) to align reads to target genomes, and ggplot2 (R version 3.2.3) (7-11). A positive rengasvirus hit in a metagenomic sample was defined as reads aligning to $25% of the viral genome; we discussed the rationale for this cutoff for CRESS virus genomes in a previous publication (6). Of the 40 data sets and 3,568 samples queried for sequence homology to the rengasvirus genome, positive hits were detected in 6 data sets, with percentages of positive samples ranging from 0.70% to 10.9% of samples (Table 1). We identified hits to the rengasvirus genome in various control samples from two different in-house studies, including two buffer-negative controls performed using the All Prep extraction kit (SRA numbers SRR6316280 and SRR6316219) (1) and one water extraction blank using the UltraSens virus kit (SRR7430813) (both kits from Qiagen, Valencia, CA) (12). Few public data sets include sequenced negative controls, precluding a detailed analysis of the origin of this putative genome or segment. However, circular DNAs have previously been identified as contaminants in nucleic extraction kit columns (13), representing a potential source for rengasvirus DNA in negative-control samples.
The rengasvirus genome described encodes only a Rep, raising the question of how it becomes encapsidated in viral particles. In one BAL fluid sample containing rengasvirus (SRA number SRR5826708), we also identified another circular DNA of 933 bases in length encoding a capsid protein with a GC content of 41.3% (GenBank accession number MW559599). We identified this sequence from metagenomic contigs using a method similar to the initial rengasvirus detection method (described above), except for using hidden Markov models (HMMs) based on viral capsid instead of Reps from vFam (3). For this molecule, we also confirmed circularity using whole-genome PCR with two sets of back-to-back primers as described above (set A Fwd, GCCTCACTTAAATAGATGTTAAGGTATGCAATG; set A Rev, GGCAAGTACTGGTACTGCACC; set B Fwd, GCCATAAGCATTCCGCGTG; set B Rev, GGCGAAGAGGAAGAGGAAGATG). This sequence also contained a DNA stem-loop with some resemblance to that of the rengasvirus Rep-encoding DNA (Fig. 1B). Thus, the two molecules together might comprise a bipartite genome. It is also possible that rengasvirus is a satellite virus relying on capsid and other functions produced by another unknown virus.
In summary, our results indicate that rengasvirus sequences are a common laboratory contaminant and provide an alignment target that can be used for quality control in future metagenome studies.
Data availability. The sequences described above been deposited in GenBank under the accession numbers MW559599 and MW559600. The sequence data set in which both sequences were originally identified is available under BioProject number PRJNA390659.