:

Whole transcriptome analysis to investigate differential gene expression and regulatory adaption can be carried out on two different technological platforms: by probe hybridisation to microarrays or by RNAseq for deep sequencing. Since there are difference in terms of their genome coverage, sensitivity and cost, there is a requirement for robust comparisons to determine the platform of choice. Here, we present datasets for the whole transcriptional response verocytoxi-genic Escherichia coli (VTEC) obtained from RNA-seq and microarray platforms in response to spinach, together with a comparison between the datasets (available at Array Express: E-MTAB-3249, E-MTAB-4120, E-MTAB-7441). Crown Copyright & 2018 Published by Elsevier Inc. This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/).


Data format
Filtered and analysed with statistical tests Experimental factors E. coli O157:H7 was grown in minimal medium at 18 °C to mid-log phase and transferred to medium containing 40% (v/v) spinach leaf lysate for 1 h Experimental features Total RNA was extracted using commercial kits and a cDNA library generated with enterobacteria-specific primers and hybridized to a microarray (E. coli v2 array -Agilent), or rRNA was depleted and paired-end cDNA libraries generated for sequence on an Illumina Hi-Seq.2000.A series of statistical analyses was used for comparison between the datasets.Data source location James Hutton Institute, Dundee, DD2

Value of the data
Direct comparison between transcriptome platforms can allow for the optimal approach to be chosen.
Microarray platforms can offer a cheap and easy approach for transcriptional analysis for model organisms, like Escherichia coli O157:H7, but are limited by the probe set and potentially, sensitivity.RNA-seq does not have the same limitations but is costlier and requires specific skills for analysis.
There are few published reports that make direct comparisons of the platforms; most adopt either one or the other.
Here, the same sample set was applied to RNAseq and microarray transcriptome platforms to provide the most robust comparison.
The comparison of the datasets showed a strong correlation between the platforms (R 4 0.8) but the presence of outliers highlights differences in their outputs.

Data
The microarray [1] and RNA-seq datasets are available in ArrayExpress.A comparative analysis pipeline (Fig. 1) was implemented for bioinformatics analysis and downstream assessment.The microarray dataset required additional processing steps since it is based on probes rather than genes and contains non-target probes from multiple Escherichia coli isolates.Correlation between the datasets from each platform required normalization before the comparison of the Log-fold change (spinach leaf lysate condition relative to the control no-plant condition) could be made.The Pearson and Kolmogorov-Smirnov tests of the Log-fold change datasets indicated strong correlation, although the distribution was tailed outside the range of À2 to þ2-fold change (Fig. 2).

Sample preparations
RNA and cDNA samples from the E. coli O157:H7 Sakai strain were obtained and used for microarray analysis as described previously [1].The cDNA library for RNA-Seq was constructed from the same RNA samples, using the approach described for previously Xanthomonas [2], and run on an Illumina Hiseq.2000 (Genomic Sequencing Unit, the University of Dundee, Dundee, UK).

Data analysis
All bioinformatics scripts and processes are listed in Table 1 and illustrated in Fig. 1.

Microarray
The published microarray dataset [1] was reanalysed to permit a dataset comparison.Raw data were normalised between samples to ensure consistency across data sets.Probes from each sample were filtered in accordance with the corresponding fluorescence of the negative control probes on the microarray plate.Only probes which were 10% brighter than the 95% percentile of the negative probes were maintained for the analysis.Linear modelling was applied to the data using the R (v3.0.2) statistical language in the R studio software (v0.97.551) [3] utilising the Bioconductor Limma library (v3.18.7) [4] and following the Limma user guide (Script EA 2 -Table 1).A dendrogram was produced using the gPlots 2.12.1 [5].

RNA-seq
FastQC software (v0.10.1)[6] was used to perform a quality check of the raw data, according to the software specifications.The reads were then aligned to the Sakai reference genome (GenBank accession number BA000007) using the Bowtie (v2 2.1.0)aligner [7].Firstly, an index was built using the Index Builder provided by Bowtie2 (Script EA 3A) and this was then used to align the RNA samples to the reference genome (Script EA 3B).Paired-end and random alignment was performed to compare the two (Script EA 11), producing one.samfile per each sample.Reads were then summarised using the featureCount software [8].First, an annotation file was generated using the Sakai reference genome (GenBank accession number BA000007) (Script EA 3C).The generated reference file was then used to extract the count information from the aligned.samfiles (Script EA 3D).The output was a tab delimited text file containing the name of the gene, start and end positions of the gene on the strand, count of mRNA and the strand direction.Files were then imported into R studio, normalised, and fitted into a linear model using the Voom function from the Bioconductor libraries Limma [9] and edgeR (v3.2.4) [10].The log of the fold change between experiment and control cultures was obtained from this linear model (Script EA 4).Volcano plots were generated (Script EA 12) using the ggplots2 package [11].2.2.3.Dataset comparisons E. coli O157:H7 Sakai sequences, containing gene identifiers, were downloaded from the NCBI database (Sakai: NC_002695.1,pOSAK1: NC_002127.1,pO157: NC_002128.1)and converted into the appropriate fasta format using the Sequence Format Converter.A Python (v2.7.6) script [12] was written to convert fasta identifiers into a suitable format for the microarray against the RNA-Seq comparison (Script EA 5).The re-formatted fasta file was then used to construct a searchable database by using Blast (v2.2.17) [13] tool blastdb (Script EA 3E).The Blastall tool was used to match the microarray probe sequences to the above described database (Script EA 3F), resulting in a list of microarray to ECs number matches.The list of matches was filtered to extract only matches in which the sequence length was higher than 50 and mismatch was lower or equal to 7 (Script EA 6).Raw data and Log-transformed fold-change (logFC) values of normalised microarray and RNA-Seq data were compared and plotted using R studio (Script EA 12); (Fig. 2).

Fig. 1 .
Fig. 1.Flow diagram of analysis steps and processes.RNAseq (blue) and microarray (green) process steps are in blue and green respectively.Other steps were done in R (pink) or externally (yellow).

Fig. 2 .
Fig. 2. Correlation graph of expression profiles.Correlation of differential expression of genes by Log fold-change (logFC), acquired from the RNAseq and microarray datasets.

Table 1
Analysis steps (EA) and associated scripts and processes.