Metagenomic data from gutter water in the city of Pointe-Noire, Republic of Congo

After Amazonia, the Congo Basin represents the second-largest tropical rainforest area in the world. This basin harbours remarkable biodiversity, yet much of its microbiological diversity within its waters, soils, and populations remains largely unexplored and undiscovered. While many initiatives to characterize global biodiversity are being undertaken, few are conducted in Africa and none of them concern the Congo Basin specifically in urban areas. In this context, we assessed the microbial diversity present in gutter water in the city of Pointe-Noire, Congo. This town has interesting characteristics as the population density is high and it is located between the Atlantic Ocean and the forest of Mayombe in Central Africa. The findings illuminate the microbial composition of surface water in Pointe-Noire. The dataset allows the identification of putative new bacteria through the assembly of 81 meta-genome-assembled genomes. It also serves as a valuable primary resource for assessing the presence of antibiotic-resistant genes, offering a useful tool for monitoring risks by public health authorities.

a b s t r a c t After Amazonia, the Congo Basin represents the secondlargest tropical rainforest area in the world.This basin harbours remarkable biodiversity, yet much of its microbiological diversity within its waters, soils, and populations remains largely unexplored and undiscovered.While many initiatives to characterize global biodiversity are being undertaken, few are conducted in Africa and none of them concern the Congo Basin specifically in urban areas.In this context, we assessed the microbial diversity present in gutter water in the city of Pointe-Noire, Congo.This town has interesting characteristics as the population density is high and it is located between the Atlantic Ocean and the forest of Mayombe in Central Africa.The findings illuminate the microbial composition of surface water in Pointe-Noire.The dataset allows the identification of putative new bacteria through the assembly of 81 meta-genome-assembled genomes.It also serves as a valuable primary resource for assessing the presence of

Value of the Data
• The dataset provides the first insights into the microbial diversity of gutter water from the city of Pointe-Noire, Republic of Congo.• The discovery of pathogenic microorganisms could help local authorities anticipate epidemics' emergence.• The dataset allows the identification of new metagenome-assembled genomes (MAGs) that are of interest to environmental microbiologists.• The dataset serves as a valuable primary resource for assessing the presence of antibioticresistance genes, offering a useful tool for monitoring risks by public health authorities as already done in Kenya, Uganda, and Tanzania.

Background
The city prevents flood damage by digging gutters to drain excessive water.Sometimes people use these gutters to discharge numerous wastes including domestic wastewater.Thus, we selected one gutter point with a mix of water (rainwater and waste) to perform a preliminary study of the microbial composition useful for both environmental microbial ecology and public health authorities ( Fig. 1 ).Results of such a project could convince public health authorities to extend the current analysis to different seasons or areas in the city of Pointe-Noire.

Data Description
The dataset is based on raw Illumina paired-end reads obtained through shotgun metagenomics sequencing of DNA isolated from gutter water collected in the city of Pointe-Noire.The raw data contain 84,886,827 paired-end reads of 150 bp (25,466 Mbases).The raw data used in this analysis and associated data analyses are available under NCBI BioProject No. PRJNA1021800.
Regarding the taxonomic distribution, using the Kaiju profiler, we identified Bacteria, viruses, Archea, and Eukaryota.The list of the microbial taxonomy of identified organisms is provided in supplementary tables S1, S2, and S3.Unclassified reads were analysed with a second profiler (kraken 2) to extract the maximum information from the data.However, although some reads were assigned to bacteria, most of them remained unassigned (supplementary table S4).Furthermore, de novo assembly of the whole dataset allowed the identification of 81 metagenome-assembled genomes ( Fig. 2 , supplementary table S5) with an associated taxonomy described in Table 1 .
Table 1 .List and taxonomy of MAGs.
Our study also included a screening of antibiotic-resistance genes in the whole assembly, with 27 antibiotic-resistance genes identified and listed in Table 2 .
Table 2 .Identification and characterization of antibiotic-resistance genes.
Fig. 2. Phylogeny of identified bins (MAGs) using the FastTree software on the multi-alignment files generated by the pipeline.These MAGs belong to several phyla: Actinobacteriota (red); Bacteroidota (aqua); Bdellovibrionota (pink); Hydrogenedentota (gray); Myxococcota (brown); Planctomycetota (green); Proteobacteria (orange) and Verrucomicrobiota (blue).(For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)

Sample collection
For this preliminary study, one liter of water was sampled once on September 19 th , 2019 within a gutter along houses in the city of Pointe-Noire (latitude and longitude 4 °48 43.1 S 11 °52 27.8 E) in the Republic of Congo.Water was transported in a bottle with an iced pack and stored at 4 °C for six days until DNA extraction.

DNA isolation, library preparation, and shotgun sequencing
Four hundred milliliter of water were successively filtered through two mixed cellulose esters filters of 5 μm, one of 1.2 μm, and three of 0.22 μm (MCE membrane; 47 mm; MF-Millipore).DNA was extracted and pooled from the six filters using the DNeasy PowerWater kit (Qiagen) according to the manufacturer's instructions.Library preparation and sequencing were performed by GENEWIZ, from Azenta Life Sciences company R .DNA library sequencing was performed on an Illumina HiSeq 40 0 0 machine in paired-end mode producing reads of 150 base pairs length.The raw data contain 2 × 84,886,827 paired-end reads of 150 bp (25,466 Mbases).

Metagenomics profiling
Two profilers were used Kaiju [ 4 ] and kraken2 [ 5 ].The latter was used on reads that were not classified by the first profiler in order to minimize false positives.We used three available databases for Kaiju, nr_euk (version 2022-03-10) a database like NR (Non-Redundant Protein database but includes fungi and microbial eukaryotes), rvdb, (version 2022-04-07) which is Reference Viral Database [ 6 ] and finally a plasmids database (version 2022-04-10).All these databases are available on Kaiju homepage.All the classifications were merged into one file at the end of the analyses.
For these classifications reads that were not classified in the first database, were used in the second, and so on.Seqtk v 1.3-r106 ( https://github.com/lh3/seqtk ) tool extracted non-assigned reads in each step and prepared the input to the next one.
All reads unclassified by Kaiju were employed as inputs for profiling with Kraken2, a nucleicbased classifier utilizing a k-mer-based similarity approach.The database used in conjunction with Kraken2 is the PlusPF database, encompassing the standard Kraken2 database (RefSeq archaea, bacteria, viral, plasmid, human, UniVec_Core) along with Ref-Seq protozoa and fungi.The indexes of this database and others are freely available at https://benlangmead.github.io/aws-indexes/k2 .

Metagenome assembled genomes (MAGs) reconstruction, taxonomic assignment, and functional annotation
Filtered and decontaminated reads were assembled by MEGAHIT [ 7 ] using kmer values ranging from 21 to 127 with a step of 2. Contigs with lengths inferior to 200 base pairs were discarded.In the same way and with the same kmer parameters, the metaSPAdes assembler [ 8 ] was used to generate the second assembly.The two binners used on each assembly (the socalled megahit assembly and the metaSPAdes assembly) are MaxBin2 [ 9 ] and Metabat2 [ 10 ].They are widely used in binning and routinely are integrated into many metagenomic data analysis pipelines.They work in much the same way but with different sensitivities.The only notable difference between the two tools is the minimum length of contigs accepted by the binner.For metabat2, all contigs with a length < 1500 base pairs are filtered and not binned.
We then used two assemblers and two binners, resulting in four binned assemblies.The goal was to refine these assemblies to extract the maximum information from our dataset, particularly given the limitation of having only one sample.Subsequently, four modules from the MetaWRAP v1.1.2[ 11 ] pipeline were applied to the bins.The bin refinement stage was performed using a MetaWRAP module (bin refinement module ), where dereplication is performed.This module combines bins to create hybrid bins after evaluating the quality of each bin using CheckM.It then removes duplicate contigs appearing in multiple bins to ultimately identify the best version of each bin.For taxonomic assignment, we used also gtdbtk v 1.4.1 [12][13][14][15][16], using the whole pipeline to place contigs/bins in the GTDB reference tree.Contigs of each bin were annotated using prokka annotation pipeline v1.12 [ 17 ].

Screening of antibiotic resistance genes
MMseqs2 (release 14-7e284) [ 18 ] was employed to cluster the two assemblies (mega-hit and metaspades).Subsequently, the clustered assembly was used for the search for antimicrobial resistance genes.For this purpose, we used ABRicate (release 12), which can be found at https: //github.com/tseemann/abricate .

Fig. 1 .
Fig. 1.Geographical overview and location of the sampling site in Pointe-Noire.(a) The sampling was performed within the red circle area and the map was obtained from the website https://www.mapnall.com/en/ .(b) photo of the gutter where water was sampled.(For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)

Table 1
Taxonomic categories of the 81 MAGs identified.

Table 2
Accession number, gene functions and antibiotic resistance.