Dataset of 16S rRNA and alkB genes in hydrocarbon polluted soils of Kuwait as revealed by Pyrosequencing

The data in this article was generated by high throughput sequencing of moderately hydrocarbon polluted sites (S1 and S2) and a heavily polluted site (S3) in Kuwait. Deoxyribonucleic acid (DNA) extracted from each site was subjected to polymerase chain reaction (PCR) amplification employing conserved primers of 16S rRNA and alkB genes. Unique Molecular Identifiers (MID) tags were added to individual samples prior to pooling and sequencing on a Roche GS FLX platform using Pyrosequencing Titanium Chemistry. Raw sff files were deposited to the public repository of National Centre for Biotechnology Information (NCBI) under accession no PRJNA816075. The sff files were clipped according to the MID tags and converted to fasta format. 16S rRNA gene sequences were aligned against the SILVA database. The predominant genera at S1 and S2 was Alkanindiges whereas Alcanivorax, was highly abundant at S3. Alkanindiges have been found to play a key role in hydrocarbon degradation and Alcanivorax genus is known for its hydrocarbon degrading capability. The alk B gene sequences were subjected to blastx. The diversity of alkB gene was higher in S3 as compared to S1 and S2. These findings may open the way to the use of the genera Alkanindiges and Alcanivorax in the rehabilitation of hydrocarbon-contaminated sites in hot, arid climates. The isolation of these microorganisms and the design of bioaugmentation procedures specific to the dry climate could be a key step towards the restoration of hydrocarbon contaminated soils.


a b s t r a c t
The data in this article was generated by high throughput sequencing of moderately hydrocarbon polluted sites (S1 and S2) and a heavily polluted site (S3) in Kuwait. Deoxyribonucleic acid (DNA) extracted from each site was subjected to polymerase chain reaction (PCR) amplification employing conserved primers of 16S rRNA and alkB genes. Unique Molecular Identifiers (MID) tags were added to individual samples prior to pooling and sequencing on a Roche GS FLX platform using Pyrosequencing Titanium Chemistry. Raw sff files were deposited to the public repository of National Centre for Biotechnology Information (NCBI) under accession no PRJNA816075. The sff files were clipped according to the MID tags and converted to fasta format. 16S rRNA gene sequences were aligned against the SILVA database. The predominant genera at S1 and S2 was Alkanindiges whereas Alcanivorax, was highly abundant at S3. Alkanindiges have been found to play a key role in hydrocarbon degradation and Alcanivorax genus is known for its hydrocarbon degrading capability. The alk B gene sequences were subjected to blastx. The diversity of alkB gene was higher in S3 as compared to S1 and S2. These findings may open the way to the use of the genera Alkanindiges and Alcanivorax in the rehabilitation of hydrocarbon-contaminated sites in hot, arid climates. The isolation of these microorganisms and the design of bioaugmentation procedures specific to the dry climate could be a key step towards the restoration of hydrocarbon contaminated soils.
© 2022 The Author(s

Value of the Data
• Data are useful for the bioremediation of hydrocarbon polluted soils generated by oil producing companies. • This data will also be interesting for environmentalists with concern on soil pollution by oil.
• The complete dataset can be used to design a series of quantitative polymerase chain (qPCR) reactions assays than can be used as monitoring tool for evaluating the effectiveness of bioremediation strategies. • The hydrocarbon degrading potential of the genera identified in the present dataset can be exploited further.

Data Description
The hydrocarbon pollution is a challenge in oil producing countries like Kuwait and the topsoil is the most exposed area to the oil spills. The influence of the top-soil contamination is more pronounced on the terrestrial life forms. The aim of this dataset was to determine whether the bacterial population diversity of the hydrocarbon polluted top-soil in the collected soil samples ( n = 3) was reflected in the diversity of a key functional gene crucial to the biodegradation of hydrocarbons present in soils. The soils were collected from a hydrocarbon-polluted site. The samples S1 and S2 originate from oil-contaminated soil located at (28 °58 35.6412" N; 48 °09 38.9099" E and the S3 from 28 °58 43.4759" N; 48 °10 07.8186" E . The samples were collected at 0-20 cm depth from the surface by using a sterilized stainless steel shovel as per the standard method reported elsewhere [1] . The samples were passed through a 1-mm sieve before the TPHs concentration was determined and the DNA was extracted for sequencing. The TPH  Fig. 1 . Samples S1 and S2 had almost similar taxonomic profile whereas S3 exhibited different taxonomies. The predominant bacterial groups present in sample 3 were mostly halophiles (salt tolerant). At S1 and S2, Alkanindiges topped the list whereas Alcanivorax was highly abundant at S3 ( Fig. 1 ). These genera were recognized as hydrocarbon degrading bacterium. A total of 30491 individual alkB sequences were obtained from the 3 samples. 7915 were obtained from sample 1, 14906 from sample 2 and 7670 from sample 3. A total of 2403 groups were obtained for sample 1 with only 74 groups containing 10 or more sequences. The equivalent figures for sample 2 are 4246 total groups and 148 with 10 or more sequences and for sample 3 are 4076 total groups with 74 containing 10 or more sequences. The majority of the groups from all three samples contained less than 5 sequences with a significant number containing only one sequence. The top 10 sequences obtained from each sample were inserted into the DSGene software package and a ClustalW alignment was performed using the default parameter settings. Clustering of alkB gene based on their abundances distributed the samples into two groups namely A (S1 and S2) and B (S3) ( Fig. 2 ).
The Shannon's diversity index is a measure of diversity that combines richness and the relative abundances. This index for the first 51 16S rRNA and alkB groups is shown in Table 1 . The population diversity of sample 3, based on the 16S rRNA, is much less than that calculated for samples 1 and 2. Samples 1 and 2 are equally diverse based on the 16S rRNA data. Conversely, sample 3 is more diverse than samples 1 and 2 based on the alkB data set.
HTS methodologies targeting both a bacterial taxonomy gene (16S rRNA) and a functional gene (alkB) demonstrated significant differences between the samples provided. Samples 1 and 2 were similar to each other but different from sample 3 when compared at the taxonomic level (16S rRNA) or functional gene (alkB) gene levels. These significant differences may be due to the level of hydrocarbon contamination in the soil. At the population (taxonomic) level sample 3 was much less diverse than samples 1 and 2, which were similar. This may indicate that the salt environment constrains the development of a diverse bacterial population.

DNA Isolation and High Throughput Sequencing
DNA was extracted from the soils using the MO BIO PowerSoil DNA Isolation Kit (MO BIO Laboratories, West Carlsbad, CA, USA). In total 250mg of soil sample was used for DNA isolation as per the manufacturer's instructions. The concentration and purity of the extracted DNA was determined using UV spectrophotometry (A260/A280 ratios) Conserved primers (16S rRNA forward primer: CAGCMGCCGCGGTAATWC 16S rRNA reverse primer: CCGTCAATTCMTTTRAGTTT; alkB forward primer: AAYACNGCNCAYGARCTNGGNCAYAA, alkB reverse primer: GCRTGRTGRTCN-GARTGNCGYTG) were used to amplify a region of 400 and 550 bp regions of the 16S rRNA and alkB genes respectively [ 2 , 3 ]. In brief, 10 pg of DNA, 1 × Taq buffer, 200 μM of deoxynucleoside triphosphate (DNTPs), 25 pMol of primers, and 1.25U of Taq polymerase were added to make a reaction volume of 50 μL. PCR was run for 21 cycles of denaturation at 94 °C (1 min), annealing 45 °C (1 min), and extension 72 °C (2 min). PCR amplification was visualized on an Agilent 2100 Bioanalyzer [4] . The PCR amplicons were tagged with unique MID sequences and pooled for clonal amplification by emulsion-based PCR and then sequenced using Pyrosequencing Titanium chemistry on a Roche GS FLX platform.

Bioinformatics Analysis
The raw .sff files checked for quality scores and processed for removal of MID tags. Clean sequences were submitted to cdhit-v4.5.4-2011-05-12 [5] . Each tag pool was clustered separately using a sequence identity threshold of 98.00%. The bidirectional sequencing reads of 16S amplicon samples were clustered (98% identity). The representative reads were then aligned to the SILVA database [6] . All Cluster-Representatives of alkB amplicon samples were blasted against a selected subset of alkB genes extracted from NCBI home page (alkB_ncbi_genes.fasta, 7972 sequences). The top 10 sequences obtained from each sample were inserted into the DSGene software package [7] and a ClustalW alignment performed using the default parameter settings for the alkB gene [8] . A phylogenetic tree was created using the Neighbour Joining Tree Building Method and Tajima-Nei Distance default settings [7] . The Shannon Diversity index (H) was calculated for the first 51 16S rRNA and alkB groups in each of the three samples [9] .

Ethics Statements
Not applicable.

Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.