Metagenomic data on bacterial diversity profiling of high-microbial-abundance tropical marine sponges Aaptos aaptos and Xestospongia muta from waters off terengganu, South China Sea

Marine sponges are acknowledged as a bacterial hotspot and resource of novel natural products or genetic material with industrial or commercial potential. However, sponge-associated bacteria are difficult to be cultivated and the production of their desirable metabolites is inadequate in terms of rate and quantity, yet bioinformatics and metagenomics tools are steadily progressing. Bacterial diversity profiles of high-microbial-abundance wild tropical marine sponges Aaptos aaptos and Xestospongia muta were obtained by sample collection at Pulau Bidong and Pulau Redang islands, 16S rRNA amplicon sequencing on Illumina HiSeq2500 platform (250 bp paired-end) and metagenomics analysis using Ribosomal Database Project (RDP) classifier. Raw sequencing data in fastq format and relative abundance histograms of the dominant 10 species are available in the public repository Discover Mendeley Data (http://dx.doi.org/10.17632/zrcks5s8xp). Filtered sequencing data of operational taxonomic unit (OTU) with chimera removed is available in NCBI accession numbers from MT464469 to MT465036.


a b s t r a c t
Marine sponges are acknowledged as a bacterial hotspot and resource of novel natural products or genetic material with industrial or commercial potential. However, spongeassociated bacteria are difficult to be cultivated and the production of their desirable metabolites is inadequate in terms of rate and quantity, yet bioinformatics and metagenomics tools are steadily progressing. Bacterial diversity profiles of high-microbial-abundance wild tropical marine sponges Aaptos aaptos and Xestospongia muta were obtained by sample collection at Pulau Bidong and Pulau Redang islands, 16S rRNA amplicon sequencing on Illumina HiSeq2500 platform (250 bp paired-end) and metagenomics analysis using Ribosomal Database Project (RDP) classifier. Raw sequencing data in fastq format and relative abundance histograms of the dominant 10 species are available in the public repository Discover Mende-ley Data (http://dx.doi.org/10.17632/zrcks5s8xp). Filtered sequencing data of operational taxonomic unit (OTU) with chimera removed is available in NCBI accession numbers from MT464469 to MT465036

Raw Filtered Analysed Parameters for data collection
The A260/A280 DNA purity ratio of the extracted metagenome was between 1.7 and 2.0. The 16S rRNA amplicon sequencing data was quality controlled by 1) filtering raw tags with the Quantitative Insights into Microbial Ecology (QIIME) software, and 2) removing the chimera sequences with the UCHIME algorithm and ChimeraSlayer reference database. The sequences that showed ≥ 97% of similarity were assigned to the same operational taxonomic unit (OTU).

Description of data collection
An approximate volume of 5 cm 3 sponge tissues was sampled from visually healthy sponge hosts, A. aaptos and X. muta , with a scalpel, placed in a new sterile plastic container, kept chilled in an ice box during transport, and then stored in a freezer at −20 °C.

Value of the data
These data show the first bacterial community profiles in high-microbial-abundance tropical marine sponges, Aaptos aaptos and Xestospongia muta , from Pulau Bidong and Pulau Redang islands. These data are useful to solve "supply problem" in initial steps of exploiting bacterial natural products and metabolites, such as the difficulty in cultivating sufficient volume and pro-duction rate of sponge-associated bacteria for the synthesis of bacteria-derived pigment, drug, biomaterial, or other bioactive compounds. Bacterial community data in marine sponges, also known as bacterial hotspots, is useful for approximate reveal and reference of an enormous gene pool and diversity for future mining commercially industrially significant bacteria. These data can be used for further experiments to provide insights into the targeted genes of interest available in the bacteria metagenome of A. aaptos and X. muta from Pulau Bidong and Pulau Redang. These data are useful for comparing bacterial communities in A. aaptos, X. muta or other sponges of similar tropical or even different geographical regions. Raw data can be used for additional or integrated bioinformatics processing with other marine sponge-associated bacterial community profile data of ecological, economic, and scientific importance.

Data Description
The metagenomic datasets presented in this article provide detailed information on the microbial community compositions in two species of wild tropical marine sponges, A. aaptos and X. muta , from Pulau Bidong and Pulau Redang. In all figures and supplementary material, the sponge-associated bacterial communities in A. aaptos of Pulau Bidong, A. aaptos of Pulau Redang, and X. muta of Pulau Bidong have been denoted by A, B, and M, respectively.
The biodiversity ( α-diversity) of the three bacterial communities was showed in a rarefaction curve ( Fig. 1 ). Fig. 1 shows the rarefaction curve of all three bacterial communities. The abundance distribution of dominant 35 genera among all three bacterial communities was displayed in species abundance heatmaps (

Experimental design, materials, and methods
Sponge tissue from A. aaptos and X. muta were collected from the coastal waters of Pulau Bidong in June 2016 at a depth of 15 m ( Fig. 1 ). Additionally, sponge tissue from A. aaptos was collected from the coastal waters of Pulau Redang in August 2016 at a depth of 18 m. A biopsy was carried out on the sponge by removing approximately 50 mm 3 of tissue from 3 proximal individuals of each area with a scalpel and placing it in a sterile urine container as sample container. The sponge tissue was kept chilled in an ice box during transport, and then stored in a laboratory freezer at −20 °C until next use.
Prior treatment was carried out on the sponge tissue to prepare it for extraction of the metagenome from the said tissue [1] . The sponge tissue was gently shaken in distilled water using a pair of sterile forceps to remove externally attached bacteria. The outermost layer or pinacoderm of the sponge was removed with sterile scalpel and forceps. Solid sponge tissue was minced into smaller pieces of approximately 2-3 mm 3 with a sterile scalpel and then stored in −20 °C. A minced tissue was put into a clean mortar filled with liquid nitrogen. Next, the frozen and submerged tissue was ground into powder until the liquid nitrogen has evaporated. The mortar was covered with a paper towel during grinding to keep tissue fragments inside the mortar. Grinding was performed in a fume hood to prevent contact with the aerosolised powder. Bacterial DNA was extracted using MO BIO PowerSoil DNA Isolation Kit (Qiagen, Germany) according to manufacturer's instructions. The extracted DNA replicates from the same sponge species and location was combined into a microcentrifuge tube. The DNA in the tube was stored at −20 °C until used in subsequent downstream application.
The sponge metagenomes were subjected to 16S rRNA amplicon sequencing. Initially, the specific primers 341F and 806R were utilised in the amplification of the V3 and V4 regions of the 16S rRNA genes [2,3] . The mentioned polymerase chain reaction (PCR) amplifications were executed with the use of Phusion R High-Fidelity PCR Master Mix (New England Biolabs). Then, the qualification and quantification of the PCR products were performed along with agarose gel electrophoresis for the detection and viewing of nucleic acids.
Subsequently, the samples with intense bands within 400 bp to 450 bp were subjected to following experiments. Thereafter, the mixed PCR products were purified using Qiagen Gel Extraction Kit (Qiagen, Germany). Afterwards, sequencing libraries were constructed using NEBNext R Ultra TM DNA Library Pre Kit for Illumina according to the manufacturer's recommendations, after which the index codes were added. At this point, the library was subjected to quality assessment on the Bioanalyzer 2100 system (Agilent) and Qubit R 2.0 Fluorometer (Thermo Fisher Scientific). Lastly, on an Illumina platform HiSeq2500 Rapid Mode PE-250, the library was sequenced and then 250 bp paired-end reads were constructed.
The 16S rRNA amplicon sequencing data was analysed via workflow below. First, paired-end reads were demultiplexed to samples according to their unique barcode and then truncated by removing the primer sequence and barcode. Subsequently, raw tags were produced by combining the paired-end reads through utilization of the Fast Length Adjustment of Short Reads software (FLASH Version 1.2.7) in sequence assembly on the overlapped and opposing reads of identical DNA fragments [4] . Next, for quality control, the raw tags were subjected to specific quality-filtering conditions to derive good-quality clean tags with the Quantitative Insights into Microbial Ecology software (QIIME Version 1.7.0) [5,6] . Then, the chimera sequences were detected and removed using the UCHIME algorithm to compare and refer the tags against the ChimeraSlayer reference database, which finally resulted with non-chimeric effective tags [7][8][9] . Followingly, the effective tags were subjected to sequences analysis by utilising the Uparse software (Version 7.0.1001) [10] . The sequences that showed equal or more than 97% of similarity were assigned to the same operational taxonomic unit (OTU).
Then, the Ribosomal Database Project (RDP) classifier (Version 2.2) [11] was employed in further annotation and classification of the representative sequence of every OTU based on the GreenGene Database [12] . Thereafter, multiple sequence alignment was performed by using the Multiple Sequence Comparison by Log-Expectation (MUSCLE) software (Version 3.8.31) to study the phylogenetic relationship among OTU and the phylogenetic difference between the dominant species of multiple groups [13] . Next, OTU abundances were standardised based on a standard sequence number according to the sample with the least sequences. The OTU annotations were graphically visualised in graphical phylogenetic analysis tool (GraPhlAn)-constructed OTU annotation trees, circular evolutionary trees, relative abundance histograms, abundance heatmap, and Krona charts [14] .
The intra-community diversity, also known as α-diversity, was analysed for the complexity of local species diversity through 6 indices, namely Observed-species, Chao1, Shannon, Simpson, ACE, and Good's coverage. The indices of samples were determined by utilising the QIIME software (Version 1.7.0) and then displayed using the R software (Version 2.15.3) [15] . The Chao1 and ACE indices were employed to identify the richness of microbial communities, whereas the Shannon and Simpson indices were used to identify the diversity of microbial communities. An- other index, the Good's coverage, was applied in characterising the sequencing depth. The intracommunity diversity was graphically represented in a Venn diagram, rarefaction curve, and rank abundance curve.
The inter-community diversity, also known as β-diversity, was assessed for the complexity differences between sam ple diversities. The inter-community diversity was determined with the utilisation of the QIIME software (Version 1.7.0) using both weighted and unweighted UniFrac. Cluster analysis was preceded by principal component analysis (PCA), which was applied to reduce the dimension of the original variables using the FactoMineR package and ggplot2 package in R software (Version 2.15.3). Principal Coordinate Analysis (PCoA) was performed to get principal coordinates and visualise from complex, multidimensional data. A distance matrix of weighted or unweighted UniFrac among samples obtained before was transformed to a new set of orthogonal axes, by which the maximum variation factor was demonstrated by first principal coordinate, and the second maximum one by the second principal coordinate, and so on. PCoA analysis was displayed by weighted correlation network analysis (WGCNA) package, stat packages and ggplot2 package in R software (Version 2.15.3). Unweighted Pair-group Method with Arithmetic Means (UPGMA) Clustering was performed as a type of hierarchical clustering method to interpret the distance matrix using average linkage and was conducted by QIIME software (Version 1.7.0).
The relative dissimilarities or relatedness between marine sponge-associated microbial communities were calculated by a distance metric that incorporates phylogenetic distances, such as weighted and unweighted UniFrac distance matrices that are common in ecological microbiology. The weighted UniFrac distance compares the pairwise differences between microbial com- munities based on the abundance of observed organisms, while the unweighted UniFrac distance compares based on the presence or absence of organisms [16] . The distance matrix data were visualised with PCoA, PCA, and UPGMA. The UPGMA refers to a simple agglomerative hierarchical clustering method commonly used in the field of ecology. The UPGMA cluster trees constructed by this method allow comparison of similarity among samples, wherein samples with the closest similarity are clustered together.
The rarefaction curve, species abundance heatmaps, and evolutionary tree are within this article ( Figs. 1-7 ). Raw sequencing data and other analysed data were published in the public repository Discover Mendeley Data (http://dx.doi.org/10.17632/zrcks5s8xp). Filtered sequencing data with chimera removed were published in GenBank with accession numbers from MT464469 to MT465036.

Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships which have, or could be perceived to have, influenced the work reported in this article.

Ethics statement
All experiments comply with the ARRIVE guidelines and were carried out in accordance with the National Institutes of Health guide for the care and use of Laboratory animals (NIH Publications No. 8023, revised 1978).