Diversity of aromatic hydroxylating dioxygenase genes in mangrove microbiome and their biogeographic patterns across global sites

Abstract Aromatic hydrocarbons (AH), such as polycyclic aromatic hydrocarbons, are compounds largely found in nature. Aromatic‐ring‐hydroxylating dioxygenases (ARHD) are proteins involved in AH degradation pathways. We used ARHD functional genes from an oil‐impacted mangrove area and compared their diversity with other sites around the world to understand the ARHD biogeographic distribution patterns. For this, a comprehensive database was established with 166 operational protein families (OPFs) from 1,758 gene sequences obtained from 15 different sites worldwide, of which twelve are already published studies and three are unpublished. Based on a deduced ARHD peptide sequences consensus phylogeny, we examined trends and divergences in the sequence phylogenetic clustering from the different sites. The taxonomic affiliation of the OPF revealed that Pseudomonas, Streptomyces, Variovorax, Bordetella and Rhodococcus were the five most abundant genera, considering all sites. The functional diversity analysis showed the enzymatic prevalence of benzene 1,2‐dioxygenase, 3‐phenylpropionate dioxygenase and naphthalene 1,2‐dioxygenase, in addition to 10.98% of undefined category ARHDs. The ARHD gene correlation analysis among different sites was essentially important to gain insights on spatial distribution patterns, genetic congruence and ecological coherence of the bacterial groups found. This work revealed the genetic potential from the mangrove sediment for AH biodegradation and a considerable evolutionary proximity among the dioxygenase OPFs found in Antarctica and South America sites, in addition to high level of endemism in each continental region.

The description of the structural and functional diversity of ARHD genes has been important in the establishment of genetic markers for contaminated environments followed by new insights for bioremediation processes (Gutierrez et al., 2013;Liang et al., 2011;Tan et al., 2015).
Molecular methods have been employed in the studies of microbial biogeography to evaluate taxonomic diversity. Microbial taxa are defined by genetic variation of any genomic locus grouped as operational taxonomic units (OTUs) (Koeppel & Wu, 2013). In recent decades, the development of cultivation-independent techniques associated with next-generation sequencing (NGS) have stimulated the investigation of the microbial diversity involved in the PAHs degradation (Flocco, Gomes, Mac Cormack, & Smalla, 2009;Gomes et al., 2007;Jurelevicius et al., 2012;Ma et al., 2015;Wu et al., 2014). In addition, biogeography studies has helped to elucidate the evolutionary relationship (governed by selection, drift, dispersal, and mutation processes) of living creatures through genetic proximity between samples from different environments, and thus understand their distribution patterns in space and over time (Hanson, Fuhrman, Horner-Devine, & Martiny, 2012). Therefore, the microbial diversity can be evaluated under seasonality effect, unusual natural changes or anthropogenic impact in the ecosystems (Lynch & Neufeld, 2015).
The scarcity of studies on evolutionary relationships and distribution patterns of ARHD genes around the world motivated the development of this work. Since there is a reasonable number of published studies that generated amplicons of ARHD genes, we decided to compile all of this information to try to understand the genetic relatedness among these sequences and their geographic distribution across the globe. For this purpose, sediment samples from an oil-impacted mangrove site was analyzed through the construction and sequencing of one α-ARHD gene library. The diversity of ARHD genes involved in hydrocarbon biodegradation pathway in the mangrove sediment was determined and compared to ARHD gene sequences derived from other studies, in an attempt to elucidate patterns of biogeographic distribution of such enzymes at a global scale.

| Sampling
Sediment samples were collected in a mangrove area affected by an oil spill in the 1980s, located in the Bertioga city, São Paulo State (23º53′49′' S, 46º12′28′'W) (Andreote et al., 2012). Samples were collected perpendicularly in the mangrove transect (approximately 300 m in total), from three sites separated from each other by at least 30 m as described previously (Andreote et al., 2012). Sampling was performed in triplicate using sterile 50 ml-polypropylene tubes, which were completely filled with sediment to prevent the entry of air, cooled, and transported to the laboratory for further processing.

| Community DNA extraction
The metagenomic DNA from 0.3 g of oil-impacted mangrove sediment was extracted in triplicate with Isolation PowerSoil ® kit (Mobio Labs, Inc. Solana Beach, USA), according to the manufacturer's instructions.
DNA quantity and quality were determined by agarose gel electrophoresis (1% w/v) and Nanodrop (Thermo Scientific). After the evaluation, DNA samples were stored at −20°C for subsequent analyzes.

| PCR amplification of the ARHD genes
The mangrove metagenomic DNA was used as template for ARHD gene PCR-based amplification using the degenerate primers: ARHDf (TTY RYI TGY AII TAY CAY GGI TGG G) and ARHDr (AAI TKY TCI   GCI GSI RMY TTC CA) (Bellicanta, 2005). These primers were designed to flank a highly conserved region of the alpha subunit from ARHD gene, with an expected amplicon size ranging between 300 and 329 bp. The PCR reaction was prepared to a final volume of 50 μl containing 1X Taq buffer (Invitrogen), 1.5 mmol/L MgCl 2 , 0.2 mmol/L dNTP mix (GE Healthcare), 1.2 μmol/L of each primer, 1 U Platinum Taq DNA polymerase (Invitrogen) and 2 μl (12 ng/μl) of eDNA. The reaction was conducted in an Eppendorf Mastercycler Gradient (Eppendorf Scientific, New York, USA) and the program consisted of an initial denaturation at 97°C for 3 min, followed by 35 cycles of denaturation at 94°C for 1 min, annealing at 57°C for 1 min and extension at 72°C for 1 min, and a final extension step at 72°C for 5 min.
PCR products were separated on 1% agarose gel (Fisher Scientific, MA) in 1X TAE buffer, using the molecular weight marker 1 Kb Plus DNA Ladder (Invitrogen) for size estimation and observed under UV light using a ImageQuant LAS 4000 system (GE Healthcare Life Sciences). | 3 of 13 de SOUSA et Al.

| Construction of the ARHD gene library
PCR products were excised from the gel and subsequently purified using Illustra GFX PCR DNA Purification Kit (GE Healthcare Life Sciences). The amplicons were inserted into the cloning vector pTZ57R/T and transformed into Escherichia coli JM 109 cells using InsTAclone PCR Cloning Kit (Thermo Scientific), according to the manufacturer's protocols. Transformed cells were plated onto LB agar containing 50 μg/ml of ampicillin, 40 μl of 5-bromo-4-chloro-3-indo lylβ-D-galactopyranoside [X-Gal (20 mg/ml)] and 40 μl of isopropyl β-D-thiogalactopiranoside [IPTG (100 mmol/L)], and incubated at 37°C for 18 hr. Positive (white) colonies were transferred to a new LB agar plate containing the same concentrations of ampicillin, X-Gal and IPTG, and incubated for 20 hr at 37°C in order to confirm the success of the ligation/transformation procedure. White colonies were transferred to 96-deep well plates with 1 ml of LB broth medium, following incubation at 37°C for 18 hr. After that, aliquots of cell growth were transferred to 96 well-ELISA plate and added of 10% glycerol for storage at −80°C.

| Sequencing of the ARHD gene library
PCR amplification of the ARHD genes from the plasmid clones was performed in 96-well PCR plates in a final volume of 25 μl using the primers M13F (5′ CGC CAG GGT TTT CCC AGT CAC GAC 3′) and M13R (5′ TTT CAC ACA GGA AAC AGC TAT GAC 3′). PCR reaction contained 1X Taq buffer (Invitrogen), 1.5 mmol/L MgCl 2 , 0.2 mmol/L dNTP mix (GE Healthcare), 0.4 μmol/L of each primer, 2 U Platinum Taq DNA polymerase (Invitrogen) and 1 μl (10 ng/μl) of plasmid DNA. The reaction was conducted in an Eppendorf Mastercycler Gradient (Eppendorf Scientific, New York, USA) and the amplification program consisted of an initial denaturation at 94°C for 3 min, followed by 30 cycles of denaturation at 94°C for 30 s, annealing at 60°C for 20 s and extension at 72°C for 90 s, and a final extension step at 72°C for 5 min. The PCR products were purified using the Illustra GFX 96 PCR Purification Kit (GE Healthcare Life Sciences) and checked on 1% agarose gel electrophoresis (Fisher Scientific, MA). The purified PCR products were then used as template for sequencing reaction using Big Dye kit (Life Technologies) and the primers M13R, according to manufacturer's guidelines. The sequencing of ARHD gene library was performed using the ABI 3500 XL platform (Applied Biosystems ® 3500 XL).

| Bioinformatic analysis of ARHD gene sequences
ARHD gene sequences obtained were up-loaded in Phred software (Ewing & Green, 1998), for the evaluation of the quality of the base calls and sequences with score <20 were removed. The filtered sequences were edited using the Bioedit software (Hall, 1999) to remove the remaining primer and vector sequences. After treatment, the sequences were classified using Blastx tool (Altschul et al., 1990) from National Center for Biotechnology Information (NCBI) against "Reference Protein" database (Refseq_protein). The best hits (returned sequence with best identity) were considered as reference for taxonomic classification.

| Phylogenetic analysis
Phylogenetic analyses were performed using the following sequence datasets: (1) ARHD gene sequences generated in this study, and (2) Homologous sequences of public domain obtained from other environments around the world. The following public sequence datasets were used: Pearl River in China (Wu et al., 2014); oil reservoir in Brazil ; rhizosphere and bulk soil samples from Antarctica (Flocco et al., 2009) Antarctic ice (Kuhn & Pellizari, unpublished); microbial mats from France (Bordenave et al., 2008); coastal marine sediments of Patagonia (Lozada et al., 2008); biphenyl-contaminated river sediment from United States (Sul et al., 2009); anthropogenic dark earth from Brazilian Amazon (Germano et al., 2012); Pine root zone from Czech Republic (Leigh et al., 2007); soils from King George Bay, Antarctic (Jurelevicius et al., 2012); PAHpolluted soil from France (Cébron et al., 2011); and soils and sediments from Antarctica (Muangchinda et al., 2015). The main goal of this step was to conduct a biogeographical analysis using a conserved region of the ARHD gene as marker. Sequence alignment was performed using the ClustalW tool (Larkin et al., 2007) in Bioedit v.7.2.5. The aligned sequences were analyzed using Mothur v.1.33.3 (Schloss et al., 2009) to determine the OPFs (Operational Protein Families) at the evolutionary distances (D) 0.03, 0.05, 0.10 and 0.20. The EMBOSS Transeq software was used to convert the OPF DNA sequences to amino acid sequences in the correct reading frame. The best OPF representatives (D = 0.05) were submitted in MEGA v5.0 program (Tamura, Dudley, Nei, & Kumar, 2007) for the construction of the phylogenetic trees.
The phylogenetic trees were constructed using Neighbor-joining method and Kimura 2-parameter substitution model (Kimura, 1980), with bootstrap values calculated from 1,000 replicate runs.

| Statistical analysis
ARHDs richness and diversity were analyzed in Mothur v.1.33.3 and were calculated based on the sequence datasets using Shannon and Simpson diversity indexes, and Chao1 and ACE richness estimators. In addition, rarefaction curves were constructed to assess the observed richness (D = 0.05). The distribution of ARHD OPFs (D = 0.05), considering the sequences from all sites and based on geographic origin, was compared and displayed in a Venn diagram (Fauth et al., 1996).
Principal Coordinates Analysis (PCoA) of Bray-Curtis distances was conducted in Past3 software version 3.14, and assessed the genetic correlation among the sites.

| A brief description of the analyzed sites
The biogeographic distribution patterns of the ARHD genes worldwide were obtained through the combined analysis of twelve datasets resulting from already published studies and two unpublished, in addition to the dataset from this study. Of all studies, one was carried out in Asia (River estuary, China), one in North America (River sediment, USA), three in South America (petroleum reservoirs, Brazil; coastal marine sediments, Argentina; dark earth of Amazonia, Brazil), three in Europe (lagoon, south-east France; pine root zone, Czech Republic; PAH-contaminated soil, France) and six in Antarctica (Figure 1) ( Table 1). All the sequences derived from these studies are available for public access in the Genbank database and were downloaded to proceed with the bioinformatics analyses performed in this study.

| Description of ARHD gene library
The gene library assembled in this study yielded 129 high quality inserts for ARHD genes (approximately 329 bp in length) from the 288 sequenced clones (44.8%). The functional classification was confirmed by BLASTx and revealed that all sequences were more closely related with the benzene 1,2-dioxygenase enzyme retrieved from Pseudomonas ( Figure 2). The average identity between the sequences was 77%, with values ranging from 45% to 94%. Therefore, the majority of OPFs represent new types of ARHD gene, emphasizing their singularity and ecological importance to the mangrove. The search for functional domains revealed the presence of the dioxygenase alpha subunit containing the "Rieske" conserved domain in all sequences of interest.

| Phylogenetic analysis of ARHD genes
Eleven OPFs were identified among the 129 gene sequences determined in this study, considering 0.05 dissimilarity level as cutoff.
The deduced ARHD peptide sequences were clustered into five distinct phylogenetic groups ( Figure 3). This phylogenetic arrangement showed an unknown catabolic diversity in the mangrove sediment represented by new putative gene sequences. The OPFs 01 e 02 were the most abundant ones, accounting for 78% and 15% of all analyzed clones, respectively. These two OPFs were grouped in different clusters, the OPF01 presented 89% sequence similarity with a benzene 1,2-dioxygenase of Pseudomonas, while the OPF02 showed 78% sequence similarity with the same peptide sequence.
OPF07 showed a greater relatedness to the OPF01, and they were both recovered into a monophyletic cluster containing two reference sequences obtained from Genbank, one from Pseudomonas sp. and other from Bordetella petrii. OPFs 02, 03, 04, 05, 06, 09, 10, and 11 formed distinct clusters more distantly related to the reference sequences, with special emphasis to the OPF10, which was recovered in a completely separate branch distantly related to all other groups.

| Biogeographic patterns of ARHD genes worldwide
Sequence analysis in the Mothur software allowed the identification of the OPF representatives for each site analyzed, totalizing 166 OPFs from 1,758 gene sequences (Table S1). The taxonomic assignment of all OPFs found, considering the five most abundant genera, revealed that 52% of all sequences matched to the genus Pseudomonas, 12% to Streptomyces, 5.7% to Variovorax, 5.4% to Bordetella and 4.8% F I G U R E 1 Sampling sites around the world 41.47% of all the sequences matched to benzene 1,2-dioxygenase, 29.95% to 3-phenylpropionate dioxygenase, 13.82% to naphthalene 1,2-dioxygenase, 10.98% to an undefined ARHD family, 3.87% to IPB-dioxygenase, 0.85% to phenoxybenzoate dioxygenase and 0.06% to catechol 2,3-dioxygenase ( Figure 4c). The sequences assigned to the benzene 1,2-dioxygenase were distributed in the sites SM, S2, S4, S5, S8, S9, and S10, representing all continents except Asia (Figure 2 and 4b). Despite being the second most abundant function in general, 3-phenylpropionate dioxygenase was distributed only in sites S1, S4, S5, and S9, representing Asia, Antarctica and South America (Figure 2).
The sequences assigned to the naphthalene 1,2-dioxygenase were distributed in more than half of the analyzed sites, representing the Asia (S1), South America (S7 and S9) Antarctica (S3, S11, S12, and S15) and Europe (S6 and S13) (Figure 2). The sequences related to the undefined ARHD Family were distributed in two sites in South America (S7 and S9), while the sequences assigned to the IPB-dioxygenase were distributed in sites located in Antarctica (S4 and S5) and Europe (S6) (Figure 2). Finally, the sequences corresponding to the phenoxybenzoate dioxygenase and catechol 2,3-dioxygenase were distributed only in one site in South America (S9) and Antarctica (S15), respectively ( Figure 2). The sites S4, S5 (Antarctica) and S9 (South America) showed greater functional diversity (Figure 2).

| ARHD gene diversity worldwide
Genetic diversity of the ARHD genes was analyzed at 0.02 and 0.05 distance (D) cutoff (Table 2). However, based on the clusters observed from an initial phylogenetic tree performed with all nucleotide sequences and BLASTP analysis to identify similarities, the cutoff value of 0.05 distance for OPF definition was chosen for subsequent analyses. At 5% distance, 93-100% coverage values were observed for the ARHD gene diversity in all sites examined, except for sites S10 (Czech Republic) and S13 (France), which showed coverage lower than 60% (Table 2). Shannon index calculations showed that ARHD diversity in the mangrove soil under study was relatively higher than in the sites S1 (China), S2 (Northeast, Brazil) and S12 (Antarctica), although lower than the other sites ( Table 2). The Simpson index measures the probability that two randomly sorted sequences within a sample will be of the same species (Simpson, 1949). The observed value for the Simpson index (62%) also demonstrated that the mangrove sediment (SM) has a relatively higher diversity (lower probability) than the sites S1, S2, and S12. The Chao 1 estimator, that measures the "species" (in this case, OPFs) richness based on the distribution of "singletons" and "doubletons" (Chao, 1984), revealed that the richness of ARHD genes in the mangrove sediment was higher than in all other sites, except for site S9 (Amazonia , Table 2). Indeed, almost 82% of OPFs determined in this study were represented by a single clone.
The rarefaction curve expressed the observed number (richness) of OPFs for each site as a function of the total sequences recovered from each site (Sanders, 1968  (Amazonia) ( Figure 5). For the sites S4, S5, S6, S7 and S11, the curve clearly tends to an asymptote and additional sampling will probably T A B L E 2 Diversity and richness indices of ARHD genes in the sites under study  Table 1 reveal only a few number of additional OPFs. Finally, for the sites S8, S10, S12, S13, and S15, the weakly curvilinear plots showed that OPFs richness is far from being complete and further sampling may reveal potential untapped diversity ( Figure 5).

The analysis of taxonomic diversity and abundance showed that
Pseudomonas was the most common genus related to the ARHDs among the sites examined ( Figure 6). The sites S4, S5, and S9 showed the highest richness of bacterial genera, suggesting that the higher diversity of dioxygenases observed in these sites ( OPFs showed endemism in Asia (site S1), South America (sites S2, S7, and S9), North America (site S8), Antarctica (sites S3, S4, S5, S11, S12, and S15) and Europe (sites S6, S10, and S13), respectively. In contrast, Europe shared one OPF with Antarctica and four OPFs with Asia, and South America shared six OPFs with Antarctica (Figure 7).

F I G U R E 6
Distribution of genera in each site based on the relative abundance of ARHD sequences annotated against the NCBI RefSeq database by BLAST F I G U R E 7 Venn diagram of shared OPFs among sites of different continental regions including Asia (site S1), North America (site S8), South America (sites S2, S7, and S9), Europe (sites S6, S10, and S13) and Antarctica (sites S3, S4, S5, S11, S12, and S15). The bar graph shows the number of OPFs found in each region. Detailed information on the sites is shown in Table 1. This analysis was run using the jvenn online platform (Bardou et al., 2014)

| Principal coordinates analysis of the ARHD genes
Principal coordinates analysis (PCoA) of Bray-Curtis distances revealed significant correlations among sites as a function of shared OPFs. The majority of the sites were closely grouped respecting their continental origin. The sites of South America (SM, S9 e S2), Antarctica (S4, S5, S11, S12, and S15) and Europe (S6 and S13) grouped together.
Although S3 and S7 sites are located in Antarctica and South America (Patagonia Coast), respectively, they showed larger discrepancy from the other sites with the same geographical location. The site S8, the only one located in North America, grouped together with the majority of the sites from South America and Antarctica. On the other hand, the site S1, the only one located in Asia, grouped together with the sites S6 and S13, both located in France (Figure 8).

| DISCUSSION
In this study, the biogeographic distribution patterns of ARHD functional genes obtained from 15 different sites worldwide was investigated, in addition to their diversity at the functional and taxonomic levels.
According to the functional assignment of the OPFs, benzene Head, Werner, and Davenport (2015) published a re-evaluation of dioxygenase gene-based phylogeny that showed, on one side, subclades with closely related genes derived from microorganisms that degrade the same substrate, and, on the other side, clades encompassing dioxygenase genes derived from microorganisms able to degrade different substrates. In the present study, lineages 1, 2, and 3 represented PAH-GN, T/B and PAH-GP dioxygenases, respectively, similarly grouped in previous articles (Gibson & Parales, 2000;Iwai et al., 2011), whereas the lineage 4 showed similarity with phenylpropionate/phthalate/aromatic dioxygenases and benzoate/aromatic dioxygenases. All enzymes showed correlation with the dioxygenase groups indicated by the studies cited above. Benzene 1,2-dioxygenase and IPB-dioxygenase were showed as the closest clades in the phylogenetic tree, as suggested by Gibson and Parales (2000) and Nam et al., 2001;. Naphthalene 1,2-dioxygenase, phenoxybenzoate dioxygenase, 3-phenylpropionate dioxygenase and catechol 2,3-dioxygenase represented the groups III, I (Nam et al., 2001), clade C into lineage 2 and clade D into lineage 3 respectively, according to previous classifications (Meynet et al., 2015).
The present work revealed interesting patterns of ARHD biogeographic distribution. Integrated phylogenetic analysis of deduced peptide sequences from OPFs of each site examined revealed two aspects: (1) site-specific genetic endemism for some of the ARHD families (e.g., some benzene 1,2-dioxygenases, IPB dioxygenases, phenoxybenzoate dioxygenases and some undefined α-ARHD families), especially in Antarctic and Amazonian sites; and (2)  were some OPFs that clustered together according to their biogeochemical origins, demonstrating ecological coherence (OPFs from sites S5, S4, S12, S15 and S11; and OPFs from sites S13 and S6). The concept of ecological coherence was well employed in the study of Oton et al., 2015, when they discovered a correlation between physicochemical properties and genetic diversity in soil samples.  (Cao, Hong, Li, & Gu, 2012;Oton et al., 2015).
The close genetic relatedness among different environments (such as S4, S5, S11, S12, and S15) could be explained by microbial dispersal across ocean currents. According to Hanson et al. (2012), most studies show that microorganisms show limited dispersal due to their low mobility, favoring its establishment in restricted areas. On the other hand, the movement of microorganisms across space may occur by passive dispersal (e.g., via wind and air) or active dispersal (e.g., via ride in larger organisms), facilitating gene flow among different geographic regions (Martiny et al., 2006;Nemergut et al., 2013). Smith et al. The phylogenetic tree showed some divergences in the distribution of the taxonomic classes inferred from the ARHD sequences.
These divergences may be related to two factors: (1) low percentage of genetic similarity with their closest bacterial hits in the database; or (2) horizontal gene transfer. The lack of accuracy and representativeness of the available databases may be one of the causes for the first factor. The second factor is a natural biological phenomenon driven by selective pressures and may lead to taxonomic divergences. The study conducted by Jacob Parnell et al. (2010) showed that microbial biogeography may be affected by horizontal gene transfer, resulting in an average similarity of functional genes higher than the taxonomic average similarity evaluated based on the 16S ribosomal gene (Jacob Parnell et al., 2010). It is already known that horizontal gene transfer raises doubts about the veracity of the likely taxonomic classification inferred from the genetic elements that move among bacterial and archaeal species. In this study, a more robust analysis would be necessary to evaluate the percentage of phylogenetic congruence between 16S rRNA and ARHD genes from several bacterial genera. Oton et al. (2015) reported 85% of phylogenetic congruence between 16S rRNA and amoA genes of Thaumarchaeota, demonstrating low horizontal transfer rate of these genes.
In conclusion, this study contributed to the knowledge on the richness, distribution, and genetic relatedness of the ARHD genes recovered from fifteen different sites, as well as the taxonomic groups likely associated with such functional genes. A moderate genetic congruence was evidenced among ARHDs worldwide and, in some cases, biogeographic endemism patterns were revealed, presumably modeled by biotic and abiotic environmental factors. Results suggest that there was a connectivity between South America and Antarctica sites, probably caused by sea currents, air masses or host organisms. This could have facilitated the dispersion of microorganisms between these two regions, helping to explain their evolutionary history.