The whole genome sequence data analyses of a Mycobacterium tuberculosis strain SBH321 isolated in Sabah, Malaysia, belongs to Ural family of Lineage 4

In 2019, 10 million new cases of tuberculosis have been reported worldwide. Our data reports genetic analyses of a Mycobacterium tuberculosis strain SBH321 isolated from a 31-year-old female with pulmonary tuberculosis. The genomic DNA of the strain was extracted from pure culture and subjected to sequencing using Illumina platform. M. tuberculosis strain SBH321 consists of 4,374,895 bp with G+C content of 65.59%. The comparative analysis by SNP-based phylogenetic analysis using maximum-likelihood method showed that our strain belonging to sublineage of the Ural family of Europe–America–Africa lineage (Lineage 4) and clustered with M. tuberculosis strain OFXR-4 from Taiwan. The whole genome sequence is deposited at DDBJ/ENA/GenBank under the accession WCJH00000000 (SRR10230353).


a b s t r a c t
In 2019, 10 million new cases of tuberculosis have been reported worldwide. Our data reports genetic analyses of a Mycobacterium tuberculosis strain SBH321 isolated from a 31year-old female with pulmonary tuberculosis. The genomic DNA of the strain was extracted from pure culture and subjected to sequencing using Illumina platform. M. tuberculosis strain SBH321 consists of 4,374,895 bp with G + C content of 65.59%. The comparative analysis by SNP-based phylogenetic analysis using maximum-likelihood method showed that our strain belonging to sublineage of the Ural family of Europe-America-Africa lineage (Lineage 4) and clustered with M. tuberculosis strain OFXR-4 from Taiwan. The whole genome sequence is deposited at DDBJ/ENA/GenBank under the accession WCJH0 0 0 0 0 0 0 0 (SRR10230353

Value of the Data
• Since Mycobacterium tuberculosis of the Ural family has never been reported in Malaysia, the whole genome sequence of this strain could provide fundamental knowledge and insight towards understanding its microbial activities. • The data could be used in the examination of the molecular characteristics and genetic variability of the M. tuberculosis strain which would benefit molecular epidemiologists. • The data, an important source towards understanding the relationship between M. tuberculosis strains from Sabah and other regions, could assist in developing informed policy in the design and implementation of tuberculosis control programme.

Data Description
Mycobacterium tuberculosis is divided into seven lineages, among these, Lineage 4 is highly diversified. Several families of strains are found in this lineage and Ural amongst them [1] . The Ural family, first reported in 2005, constituted 15% of the researched strains in the Middle Ural area of Russia [ 2 , 3 ]. Besides Russia, these strains are mainly found in Iran, Afghanistan, Pakistan, Turkey, Kyrgyzstan, Ukraine, Abkhazia, Kazakhstan, Georgia and Armenia [1] . Apart from only one report from northern India and north eastern China, these satins have not yet been reported from any South, East, and Southeast Asia countries [1] . The Malaysian Borneo state of Sabah, located in the region of Southeast Asia where tuberculosis cases are increasing, has one of the highest cases of tuberculous in Malaysia [4] . The exact reason for this high incidence of tuberculosis is unknown; it was due to this gap of data that we undertook a project to perform whole genome sequence (WGS) analysis of M. tuberculosis strains from Sabah to determine their genetic characterizations.
Data analysis of the WGS of M. tuberculosis strain SBH321 from Sabah, Malaysia is documented in this paper. M. tuberculosis strain SBH321 was isolated from a 31-year-old Filipino female patient from Lahad Datu, Sabah. She was confirmed to be tuberculosis-positive by Gen-eXpert MTB/RIF. The strain was cultured using BACTEC MGIT system. WGS was performed by Illumina HiSeq 40 0 0 system. The de novo assembly of genome generated 114 contigs with N50 of 193,257 bp. The genome size was 4,374,895 bp with 4059 predicted genes and 65.59% of G + C content. The comparative genomic analysis using the WGS of 78 strains revealed that SBH321 strain belonged to the Ural family of Lineage 4 and resembled the strains of OFXR-4 from Taiwan ( Fig. 1 ).

Bacterial culture and DNA extraction
The M. tuberculosis strain SBH321 was isolated from the sputum of a patient with tuberculosis diagnosed by GeneXpert MTB/RIF. The strain was grown in 7H9 Middlebrook medium, and incubated at 37 °C in a BACTEC MGIT 320 system (Becton-Dickinson, Oxford, United Kingdom). The genomic DNA extraction was performed using Masterpure Complete DNA and RNA purification kit (Epicenter Inc., Madison, Wisconsin, USA) according to the manufacturer's instructions but with modification in the lysis step by extending the lysis duration to 16 h. The quality of the extracted DNA was determined by Nanodrop 20 0 0c spectrophotometer (ThermoFisher Scientific, USA).

Whole genome sequencing and bioinformatic analyses
99% of the genome was completely sequenced using 386 × sequencing coverage, generated a total of 11,355,058 paired reads of a 150-bp paired-end library via NEB next Ultra kit (Illumina, San Diego, CA). The sequencing data was deposited in the Sequence Read Archive (SRA) (Bio-sample accession number of SAMN12878104) and under the bio-project accession number PRJNA575111. SPAdes version 3.11.1 [5] software was used for de novo assembly, and NCBI Prokaryotic Genome Annotation Pipeline (PGAP) [6] software utilized to annotate the generated contigs.

Variant calling
For the variant calling analysis, the raw sequence reads were first aligned to a reference genome, M. tuberculosis H37Rv (GenBank accession number NC_0 0 0962.3) using BWA MEM version 0.7.1231 [7] in SAM-BAM format. In order to convert the format into readable sequences and sort the alignments, Samtools version 0.1.1932 [8] was used. Next, Genome Analysis Toolkit (GATK) version 3.4.033 [9] performed local realignment of the sequence reads and generated the reports on variant calling analysis of the M. tuberculosis strain SBH321. SnpEff version 4.134 [10] was utilized for the annotation of single nucleotide polymorphism (SNP).

SNP-based phylogenetic genotype data of SBH321
The entire SNP matrix used in the phylogenetic analysis was performed by the Maximum Likelihood method using MEGA (Molecular Evolutionary Genetic Analysis) X [11] after aligning the nucleotide sequences using CLUSTALW [11] . The significance of the branching patterns was evaluated by bootstrap analysis of 10 0 0 replicates. The whole genome sequence of 78 strains of M. tuberculosis were extracted from GenBank and were used in phylogenetic analysis [12][13][14] which showed that our strain belonged to the Ural family of Europe-America-Africa lineage (Lineage 4) and clustered with ofloxacin-resistant M. tuberculosis strain OFXR-4 from Taiwan [15] .

Nucleotide sequence accession number
The whole genome sequence has been deposited at DDBJ/ENA/GenBank under the accession number WCJH0 0 0 0 0 0 0 0.

Ethics Statement
This data was approved by the Ethics Committee of the Faculty of Medicine and Health Sciences, Universiti Malaysia Sabah [JKEtika 2/16 (6)].

Declaration of Competing Interest
No competing interest.