Complete genome assembly data of paenibacillus sp. RUD330, a hypothetical symbiont of euglena gracilis

An unknown bacterial strain was detected in the cytostome of Euglena gracilis and on the cell surface of Euglena gracilis using transmission electron microscopy. To identify the unknown bacterium and its function, we performed isolation experiments. Here we present the genome sequence of the isolate that was determined to be Paenibacillus sp. The genome of the bacterium was sequenced four times using Illumina technology with pair-end reads, Illumina technology with mate pair reads (inserts 3–4 and 6–8 Kb), and Nanopore technology with long reads (tens of thousands of nucleotides). Assemblies based on Illumina reads including mate-pair reads could not resolve issues caused by long tandem copies of rRNA, other tandem repeats, and extremely GC-rich regions (90–100%). Only long Nanopore reads resolved those gaps and made it possible to complete the entire genome; moreover, we found one plasmid. The length of the genome is 5.56 Mbp, and the average GC content is 59%. The genome of Paenibacillus sp. RUD330 included 8 copies of all the rRNA genes (23S; 16S; 5S), the length of the plasmid was 8.3 Kb. We hope that our genome assembly and the methods used can help other investigators in the assembly of complex genomes. Our reliable assembly could be a good basis for further physiological and genetic engineering studies of similar strains.

Paenibacillus Illumina Nanopore NGS sequencing Genome assembly a b s t r a c t An unknown bacterial strain was detected in the cytostome of Euglena gracilis and on the cell surface of Euglena gracilis using transmission electron microscopy. To identify the unknown bacterium and its function, we performed isolation experiments. Here we present the genome sequence of the isolate that was determined to be Paenibacillus sp. The genome of the bacterium was sequenced four times using Illumina technology with pair-end reads, Illumina technology with mate pair reads (inserts 3-4 and 6-8 Kb), and Nanopore technology with long reads (tens of thousands of nucleotides). Assemblies based on Illumina reads including mate-pair reads could not resolve issues caused by long tandem copies of rRNA, other tandem repeats, and extremely GC-rich regions (90-100%). Only long Nanopore reads resolved those gaps and made it possible to complete the entire genome; moreover, we found one plasmid. The length of the genome is 5.56 Mbp, and the average GC content is 59%. The genome of Paenibacillus sp. RUD330 included 8 copies of all the rRNA genes (23S; 16S; 5S), the length of the plasmid was 8.3 Kb. We hope that our genome assembly and the methods used can help other investigators in the assembly of complex genomes. Our reliable assembly could be a good basis for further physiological and genetic engineering studies of similar strains.
© 2020 The Authors. Published by Elsevier Inc. This is an open access article under the CC BY license.

Value of the data
• Our reliable assembly could be a good basis for physiological, phylogenetic, and genetic engineering studies. The description of the methods used can help in the assembly of complex genomes. • The data provided in this article could be useful for microbiologists, genetics, genetic engineers, ecologists. • The assembled genome can be used for the search of certain genes, transcriptional factors, transcriptomic investigations, and strain and species comparisons. • We describe the challenges encountered in the assembly of this genome, and we hope that our solutions will help researchers facing the same problems.

Data description
A bacterial strain was detected in the cytostome of Euglena gracilis and on the cell surface of Euglena gracilis using transmission electron microscopy. The environmental interactions between E. gracilis and bacterium were unclear. To identify the bacterium and its function, we performed isolation and sequencing experiments. The assembly of the complete genome met serious challenges: long and short tandem repeats and regions with high GC-content. Several sequencing technologies were used for the completion of the genome. Here we present the genome sequence of the isolate that is determined as Paenibacillus sp. The PCR product for 16S RNA isolated from the strain and Euglena gracilis culture was homogenous in sequence and was 100% identical to Paenibacillus humicus by BLAST.
No single tool gave the ideal assembly from Illumina reads in terms of both N50 and number of mis-assemblies ( Table 1 . Metrics of alternative draft assemblies), so the data of all the assemblies were used to verify one another. Assemblies based on Illumina reads including mate-pair reads could not resolve issues caused by long tandem copies of rRNA and extremely GC-rich regions (90-100%). Long Nanopore reads resolved those gaps and made it possible to complete the  entire genome. Nanopore sequencing confirmed the correctness of scaffold assembly and clarified the sequences of tandem repeats; moreover, we found one plasmid. Issues and ambiguities are shown in Supplementary Table S1. The length of the genome Paenibacillus sp. RUD330 is 5.56 Mbp, and the average GC content is 59%. The mean coverage of the genome by the reads of three Illumina libraries was 467; 209 for Nanopore libraries. The length of the plasmid is 8.3 Kb, with the coverage by Nanopore reads at 429. We suppose that it is a two-copy plasmid.

Species identity of Euglena
The species identity of Euglena gracilis has been confirmed using PCR and sequencing of mitochondrial COI, COII , chloroplast PsaB and RbcL .

DNA isolation, libraries preparation, sequencing
DNA was isolated using the DIAtom DNAprep 100 kit (Izogen, Moscow). The sequencing library with an insert size of 30 0-40 0 bp was prepared using the TruSeq DNA sample preparation kit (Illumina, USA) after the ultrasonic fragmentation of genomic DNA with Covaris S220. Two mate pair libraries with insert size ranges of 30 0 0-40 0 0 and 60 0 0-80 0 0 bp were created with the Nextera mate pair sample preparation kit (Illumina). The libraries were sequenced on Illumina HiSeq 20 0 0, generating paired-end reads of 100 nt.
The library for Nanopore technology was prepared out of non-fragmented total genomic DNA using NEB Next Ultra II DNA library kit (NEB, UK) and Ligation Sequencing kit 1D (Oxford nanopore technologies, UK), barcoded using Native barcoding kit, and sequenced on MinION, R10 flowcell (Oxford nanopore technologies, UK).
Nanopore reads with an average Phred quality score lower than 7 were discarded by Guppy 3.4.3 [3] .
The circularity of the final assembly and absence of genomic regions that could be tandemly duplicated or lost due to mis-assemblies has been confirmed by mapping mate pair reads; moreover, we also checked for the absence of regions where the insert size of mate pair reads deviated from the average. To do this, the reads of the mate pair library with larger insert size 6-8 kB were mapped to the genome by CLC Assembly Cell 4.2 ( www.clcbio.com ), with the options set to map fully and without mismatches. The average insert sizes over all genome positions were visualized as a graph. The visual inspection indicated no regions with abrupt changes (more than 500 nt) in average insert sizes, which suggests that there were no mis-assemblies that resulted in large insertions or deletions.

Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships which have, or could be perceived to have, influenced the work reported in this article.