Assembly of genomic reads of elite indica rice cultivar onto 2101 reference bacterial genomes for identification of co-sequenced endophytic bacteria

Reference based assembly of genomic reads of the elite indica rice cultivar RP Bio-226 was carried out against 2101 reference bacterial genomes using Bowtie-2 genome assembly tool. Five types of data: Number of paired end reads concordantly aligned exactly only once, number of paired end reads concordantly aligned more than once, number of mates that make the pairs aligned exactly only once, number of mates that make the pairs aligned more than once and overall percentage of alignment were collected. Interpretation of the results and identification of endophytes based on these alignment statistics are described in detail in our research article “L.Battu, M.M. Reddy, B.S.Goud, K.Ulaganathan, K.Ulaganathan, Genome inside genome:NGS based identification and assembly of endophytic Sphingopyxis granuli and Pseudomonas aeruginosa genomes from rice genomic reads, Genomics (in press)”.


Specifications
This data can be compared to similar alignment of genomic reads from other rice cultivars onto reference bacterial genomes.
The alignment data explains a new method of identification of endophytes from plant genomic reads and can be followed in other plants.

Data
Results of reference based assembly of genomic reads of elite indica rice cultivar RP Bio-226 in tabular form are the data mentioned in this paper. The tabulated data include five types of alignment statistics data: Number of paired end reads concordantly aligned exactly only once, number of paired end reads concordantly aligned more than once, number of mates that make the pairs aligned exactly only once, number of mates that make the pairs aligned more than once and overall percentage of alignment. (Supplementary Table-1) Additionally, genomes which showed maximum alignment with rice reads with respect to the above 5 different types of data are tabulated separately and enclosed (Tables 1-5).

Experimental design, materials and methods
Total DNA was isolated from the Leaves of in vitro grown Oryza sativa indica cultivar RP Bio-226 plants and sequencing library was prepared [1]. Whole genome sequencing was carried out with Table 1 Reference based assembly of RP Bio-226 genomic reads on to bacterial genome: List of genomes to which more than 10,000 paired end reads aligned concordantly exactly only once. Table 2 Reference based assembly of RP Bio-226 genomic reads on to bacterial genome: List of genomes to which more than 10,000 paired end reads aligned concordantly more than once.

S. No
Name of the Bacterial Genomes used as reference Illumina_Nextseq. 500 system (Illumina, San Diego, CA). The raw data files in Fastq format were used for further analysis. The pre-processing of raw reads was done with FastQC and the adapters were removed with Cutadapt tool [2,3]. After pre-processing, the reads were aligned to the reference genome by using Bowtie2 (ver. 2.2.4) [4]. The reference genomes of 2101 bacterial species including Pseudomonas aeruginosa PAO1 and Sphingopyxis granuli were downloaded from NCBI. Reference based assembly of the reads against these reference genomes involved, indexing of the reference genomes and alignment of reads to the reference and creation of SAM files. Samtools (ver 0.1.18) was used for further analysis [5]. SAM files were converted into binary BAM files, sorted and indexed by using the 'view', 'sort' and 'index' functions of SAMtools. The consensus sequences were created with Samtools. Genome annotation was carried out with RAST and BaySys servers [6,7]. The tRNAs were identified with tRNAscan-SE software and rRNAs were identified with Rfam software [8,9].  Table 3 Reference based assembly of RP Bio-226 genomic reads on to bacterial genome: List of genomes to which more than 10,000 mates that make up the pairs aligned concordantly or discordantly exactly 1 time. Table 4 Reference based assembly of RP Bio-226 genomic reads on to bacterial genome: List of genomes to which more than 10,000 mates that make up the pairs aligned concordantly or discordantly more than once.

S. No
Name of the Bacterial Genomes used as reference  Table 5 Reference based assembly of RP Bio-226 genomic reads on to bacterial genomes: List of genomes to which at least 0.1% paired end reads are aligned concordantly or discordantly.