Short-read fastA files dataset from complexity-reduced genotyping by sequencing data of bacterial isolates from a public hospital in Australia

This data article contains short-read sequences (length 30–69 bp) obtained from complexity-reduced genotyping by sequencing (GBS) of 165 samples bacterial isolates from hospital patients in the Australian Capital Territory, between 2013 and 2015. These samples represented 14 bacterial species. Data format is shown as filtered fastA files obtained from an Illumina HiSeq2500 sequencer. The experimental factors of this research used three complexity reduction methods with three combinations of restriction enzymes: PstI with MseI, PstI with HpaII and MseI with HpaII.

fastA files for the certified reference of genomic DNA of Escherichia coli [1] O157 (EDL 933) IRMM449 Sigma-Aldrich certified reference standard, GenBank accession number AE005174.2, genome size of 5,639,399 bp [2]. The short-read fastA files were generated by the following method: all samples were processed using the three combinations of restriction enzymes, PstI-HpaII, PstI-MseI and MseI-HpaII. All of the fastA file identification numbers for each complexity-reduction method are shown in Table 1 and  Table 2. Data can be accessed using the link above. Each folder contains a Sample_info.csv file which contains the sample names corresponding to each fastA file for each method.

Bacterial strains
A total of 165 samples were isolated from hospital patients and environmental samples in the Australian Capital Territory, between 2013 and 2015. Bacterial samples were provided by the Microbiology Department of Canberra Public Hospital and represented 14 bacterial species. DNA was extracted from all bacterial isolates using a chloroform-isoamyl alcohol method [1].

Library preparation and sequencing
Library preparation followed the DArTseq™ (Canberra, Australia) methods, in which the DNA was digested with pairs of restriction enzymes. The restriction enzymes PstI (5 0 -CTGCAjG-3 0 ), MseI (5 0 -TTAjA-3 0 ) and HpaII (5 0 -CCGjG-3 0 ) were used in combination: PstI with MseI, PstI with HpaII and MseI with HpaII. Bacterial isolates were sequenced to approximately 100,000e150,000 reads per sample. Clustering was done according to Illumina (San Diego CA, US) protocols using a HiSeq SR Cluster Kit V4 recipe v9.0 and HiSeq SR Flow Cell v4. For sequencing, the Flow Cell was loaded according to the Illumina protocols on a HiSeq 2500 sequencer, using HiSeq SBS kit v4 for a total of 77 cycles [3].
Sequences can be downloaded as filtered fastA files. Table 1 shows the filtered fastA file identification numbers for six technical replicates of each complexity-reduction method for the certified Specifications Value of the data The data obtained using DArTseq™ complexity-reduced genotyping by sequencing, using three combinations of restriction enzymes, will be useful for comparison of complexity-reduction methods.
The datasets of short-read sequences from bacterial isolates provide an insight to the resolution achieved by complexityreduced genotyping by sequencing. The data will be useful for future studies of complexity-reduced genotyping data done on bacterial isolates, as it contains short-reads of the certified reference of Escherichia coli O157 (EDL 933).   Table 2 indicates the identification name for each bacterial isolate along with the filtered fastA file identification numbers for all 165 bacterial isolates for the three combinations of restriction enzymes used.

Production of data files
Raw data obtained from the sequencer in the form of fastQ files were demultiplexed using the DArTseq™ primary data processing pipeline. It produced one fastQ file for each sample assayed. Filtering of reads was done in two steps on Phred score [4] as described in Georges et al. [3]. The barcode was removed from the reads, leaving fragments of 69 bp. The fastQ files were condensed into a fastQcol files which contained each unique sequence present in the original fastQ file, along with the respective read counts and the mean quality score at each base [5]. Unique sequences contained SeqIndex identifiers. Adapters were trimmed leaving fragments up to 69 bp. Fragments with less than 30 bp were removed. The datafiles consist of fastA files produced by the method described above.

Funding
This research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors.