Microbiome dataset of spontaneously fermented Ethiopian honey wine, Tej

This dataset contains raw and analyzed microbial data for the samples of spontaneously fermented Ethiopian honey wine, Tej, collected from three locations of Ethiopia. It was generated using culture independent amplicon sequencing technique. To gain a better understanding of microbial community variance and similarity across Tej samples from the same and different locations, the raw sequenced data obtained from the Illumina Miseq sequencer was subjected to a bioinformatics analysis. Lower diversity and richness of both bacterial and fungal communities were observed for all of the Tej samples. Besides, samples collected from Debre Markos area showed a significant discriminating tax for both bacterial and fungal communities. In nutshell, this amplicon sequencing dataset provides a useful collection of data for modernizing this spontaneous fermentation into a directed inoculated fermentation. Detail discussion on microbiome of Tej samples is given in [1].


a b s t r a c t
This dataset contains raw and analyzed microbial data for the samples of spontaneously fermented Ethiopian honey wine, Tej , collected from three locations of Ethiopia. It was generated using culture independent amplicon sequencing technique. To gain a better understanding of microbial community variance and similarity across Tej samples from the same and different locations, the raw sequenced data obtained from the Illumina Miseq sequencer was subjected to a bioinformatics analysis. Lower diversity and richness of both bacterial and fungal communities were observed for all of the Tej samples. Besides, samples collected from Debre Markos area showed a significant discriminating tax for both bacterial and fungal communities. In nutshell, this amplicon sequencing dataset provides a useful collection of data for modernizing this spontaneous fermentation into a directed inoculated fer-mentation. Detail discussion on microbiome of Tej samples is given in [1] .
© 2022 The Author(s

Value of the Data
• Helps to identify the dominant bacterial and fungal genus found in Tej samples.
• Helps to understand the differences and similarities of the microbial community structure for spontaneously fermented Tej samples. • Helps on the development of direct Tej fermentation system.

Data
This dataset contains the microbiome data of both bacteria and fungi communities for Tej samples collected from three different locations of Ethiopia. The raw bacterial and fungal FASTA files of each sample are made accessible via National Center for Biotechnology Information (NCBI) data repository system. These FASTA files were the original metadata that were used for the bioinformatics analysis of this study. Table 1 , describes the alpha diversity indices (Chao 1, Shannon, Simpson, Evenness, InvSimpson and observed) of each sample. This table is aimed to show the differences in alpha diversity indices based on sample collecting areas. Besides, Table 2 shows the list of bacterial and fungal communities that has less than 1% relative abundance. It showed all level of taxonomical classifications (Phylum, Class, Order, Family, and Genus) alongside its relative abundance of both bacterial and fungal communities. Both tables are made ac-    cessible on Science Data Bank data repository system. Furthermore, the quantitative bacterial and fungal beta diversity of the collected Tej samples was illustrated by using weighted-Unifrac principal coordinate analysis (PCoA) plot ( Fig. 1 ). The relative abundance of each taxon for both bacterial and fungi communities from respective sample collection areas were the major comparing factor for microbial ecology diversity analysis. The distance metrics in the weighted-Unifrac PCoA plot demonstrated differences in microbial taxon abundance between the collected Tej samples ( Fig. 1 ). Moreover, Fig. 2 demonstrate linear discriminant analysis effect size (LefSe) of bacteria and fungi for collected Tej samples based on the sample collection area. This figure was basically used to describe the significantly higher abundant bacterial and fungi taxon found in the grouped samples. Besides, all of the identified taxon in Fig. 2 were screened out using a linear discriminant analysis score of greater than 3.0. ( Fig. 2 ).

Sample collection, transportation and storage
Twenty-one fully matured Tej samples were collected from Addis Ababa (lat. 8.9806, long. 38.7578), Bahir Dar (lat. 11.5742, long. 37.3614), and Debre Markos (lat. 10.3296, long. 37.7344), Ethiopia. The samples were collected from local alcohol vendors who were selected randomly based on their willingness to sell. All of the samples were collected aseptically using sterile screw cup. Besides, samples from the same locations were collected on the same day. Finally, the collected samples transported to Kyungpook National University, Korea via insulated ice box with a freezing pack. The samples that required further analysis was stored in freezer at -20 °C.

DNA extraction
About 40 mL of Tej samples were centrifuged at 3200 rpm for 20 m to harvest the highest cell concentration. The microbial DNA was then extracted from the sediment via QIAamp Pow-erSoil Pro Kit (QIAGEN, Germany) by following manufacturer protocol. The final concentration of the extracted microbial DNA was checked by Qubit 2.0 Fluorometer (Life Technologies, USA).

16SrRNA sequencing
Amplicon sequencing for each sample was performed using a barcode set of Nextera Library Preparation Kit (Illumina Inc., USA). The hypervariable (V4 -V5) region of 16S rRNA gene was PCR amplified by using 515F (GTGNCAGCMGCCGCGGTAA) as the forward-inner primer and 907R (CCGYCAATTYMTTTRAGTTT) as the reverse-inner primer [2] . The PCR amplifications by thermocycler (Mastercycler Nexus GSX1, Eppendorf, Germany) were performed in two phases. The first PCR was run at the condition of 95 • C for 5 min of pre-denaturation, followed by 15 cycles of 95 • C for 30 s of denaturation, 60 • C for 30 s of annealing, 72 • C for 30 s of extension, and 72 • C for 5 min of final extension [3] . The reaction mixtures were composed of 1 μL (1 μM) of reverse inner primer, 1 μL (1 μM) of forward inner primer, 2 μL DNA template, 25 μL Emerald Amp PCR Master Mix (Takara Co., Ltd., Japan). The total volume of the PCR reaction mixture was then adjusted to become 50 μL by sterilized distilled water (SDW). The second PCR was conducted under the same running conditions as the first, by adding bar code primers and 2 μL of first PCR amplified DNA templets. These PCR amplified products were then multiplexed to 100 ng/μL into the single product via measuring the DNA concentration. Finally, amplified and barcoded DNA having 550 bp of size were selected using AMPure XP for PCR Purification (BECKMAN COULTER Inc., USA) for further downstream procedures.

Internal transcribed spacer (ITS) sequencing
Fungal internal transcribed (ITS2) regions were targeted for amplification using the primers of ITS86F (GTGAATCATCGAATCTTTGAA) and ITS4 (TCCTCCGCTTATTGATATGC) [4 , 5] . The first PCR amplification was performed at a condition of 95 °C for 5 min, followed by 30 cycles of 95 °C for 30 s, 58 °C for 30 s, 72 °C for 30 s, and finally 72 for 5 min (Jung et al., 2020). The second amplification was also carried out in the same condition as it was done for the first one. The reaction mixtures for the above mentioned two PCR amplifications were composed of 1 μL (1 μM) of reverse primer, 1 μL (1 μM) of forward primer, 2 μL DNA template, 25 μL Emerald Amp PCR Master Mix, 21 μL sterilized distilled water (SDW).

High-throughput sequencing
Before high-throughput sequencing, the amplicon library size, and quality and quantity were double-checked via Agilent 2100 Bioanalyzer (Agilent Technologies Inc., USA). Then amplicon libraries were directly subjected to the Illumina MiSeq platform by following the manufacturer's instructions. The base calling and image analysis were performed using MiSeq Control Software (MCS) which is installed in the Illumina MiSeq instrument.

Bioinformatics and statistical analysis
Quantitative insights into microbial ecology 2 (QIIME2) was used for the analysis of raw sequence FASTQ data. Filtering, trimming, and denoising of the raw sequences were performed via DADA2 to obtain amplicon sequence variants (ASV) [6] . Taxonomic identification of bacterial and fungal communities, the SILVA and UNITE reference databases were utilized, respectively. Vegan package was used for alpha diversity analysis of Shannon, Chao1, Simpson, Evenness, and InvSimpson. Meanwhile, the linear discriminant analysis effect size (LEfSe) and principal coordinates of analysis (PCoA) plots were performed via Web-based Calypso and RStudio 4.0.3. All of these microbiome data analyses were performed by applying a non-parametric Kruskal-Wallis tests with alpha value of less than 0.05 to detect significant difference in microbiome features between the group of collected sample.

Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Data Availability
Alpha diversity and Microbial community tables (Original data) (Science Data Bank).