Whole genome sequencing data of Streptococcus pneumoniae isolated from Indonesian population

Streptococcus pneumoniae is the leading cause of bacterial pneumonia, bacteremia, and meningitis. Indonesia introduced the pneumococcal conjugate vaccine (PCV) nationwide in 2022. In this study, we present whole genome sequence (WGS) data of 94 S. pneumoniae isolates that were obtained from hospitalized patients, healthy children, and adult groups from different regions prior to PCV program in Indonesia. DNA sequences of S. pneumoniae were obtained using the TruSeq Nano DNA kit (Illumina NovaSeq6000 Platform). The genome data of S. pneumoniae features a 1,969,562 bp to 2,741,371 bp circular chromosome with 39–40% G+C content. The genome includes 1935–3319 coding sequences (CDS), 2 to 5 rRNA genes, 43 to 49 tRNA genes, and 56 to 71 ncRNA. These data will be useful for analyzing the serotype, sequence type, virulence genes, antimicrobial resistance genes, and the impact of pneumococcal vaccination in Indonesia. The FASTQ raw files of these sequences are available under BioProject accession number PRJNA995903 and Sequence Read Archive accession numbers SRR25316461-SRR25316554.


a b s t r a c t
Streptococcus pneumoniae is the leading cause of bacterial pneumonia, bacteremia, and meningitis.Indonesia introduced the pneumococcal conjugate vaccine (PCV) nationwide in 2022.In this study, we present whole genome sequence (WGS) data of 94 S. pneumoniae isolates that were obtained from hospitalized patients, healthy children, and adult groups from different regions prior to PCV program in Indonesia.DNA sequences of S. pneumoniae were obtained using the TruSeq Nano DNA kit (Illumina NovaSeq60 0 0 Platform).The genome data of S. pneumoniae features a 1,969,562 bp to 2,741,371 bp circular chromosome with 39-40% G + C

Value of the Data
• Data can support a comparative genome study of S. pneumoniae isolated from patients and healthy people in different region of Indonesia.
• Data provides insight into the mechanism of antibiotic resistance and predicts the antimicrobial resistance profile for further drug development and disease treatment.
• Data provides a baseline data for pneumococcal vaccination impact in Indonesia.

Background
Streptococcus pneumoniae ( S. pneumoniae ) is a Gram-positive, lancet-shaped, diplococcus bacteria that is classified as a fastidious bacterium that can grow in a facultatively anaerob environment.This bacterium typically resides in the nasopharynx as normal flora and capable to breaching sterile body sites, giving rise to various infections including meningitis, bacteremia, and pneumonia [1] .The percentage of deaths caused by pneumonia in children under the age of five in Indonesia is 4% in 2021 [2] .
A comprehensive whole-genome sequencing data of S. pneumoniae has never been conducted in Indonesia before, marking a critical gap in our understanding of its genetic diversity and its implications for public health.The acquisition of whole-genome sequencing data for S. pneumoniae in Indonesia holds immense promise.Unlocking the entire genetic code of the bacterium may unravel its evolutionary history, pinpoint virulence factors, and identify antibiotic-resistance genes, all of which are crucial for tailoring effective strategies to combat infections.This datadriven approach can enhance disease surveillance, facilitate the tracking of transmission routes, and aid in the selection of appropriate treatment regimens, ultimately contributing to more precise and timely interventions against S. pneumoniae infections in Indonesia [3] .Furthermore, monitoring and evaluation of data during vaccination and post-vaccination era is essential for assessing the effectiveness of the vaccine, detecting changes in disease trends, improving vaccine coverage and future vaccine development [4] .
Quality control results show that entire sample has a good quality with a mean of Qscore > 30 ( Fig. 2 ).GC content results from 94 samples are quite similar, ranging from 39 % to 40% ( Fig. 3 ).Whole genome sequencing results generate millions of reads.The number of reads generated is 19,515,568-29,732,816 ( Fig. 4 ).The assembly results showed that most of the samples had a    genome size of around 2,0 0 0,0 0 0 bp (2 Mb).Two samples have size > 2Mb (2.4 Mb and 2.75 Mb) ( Fig. 5 a).Most of the WGS results in this study showed a similar range of gene numbers (2046 to 3441 genes) ( Fig. 5 b).Furthermore, the number of Coding Sequences (CDS) is around 1935-3319 ( Fig. 5 c).The number of hypothetical proteins produced is 75 to 234 proteins ( Fig. 5 d).Based on the results of the Taxonomic Classification using K-mer/ANI and 16S rRNA, all samples were S. pneumoniae with the TaxID code 1313.

Bacterial culture
All S. pneumoniae isolates were stored in skim-milk-tryptone glucose glycerol (STGG) medium and stored at -80 °C.S. pneumoniae isolates were streaked into 5% sheep blood agar and incubated in 5% CO 2 at 37 °C overnight.The isolate was examined for the appearance of alphahemolytic colonies and were identified by susceptibility to optochin ( Fig. 1 ).

DNA extraction and library preparation
Genomic DNA was extracted from S. pneumoniae isolates using the DNeasy Blood & Tissue Kit (Qiagen, 69606) pre-treatment with mutanolysin and lysozyme.The S. pneumoniae genome was sequenced at PT Indolab Utama, Jakarta, Indonesia (Macrogen Co., Ltd., Singapore) using the Illumina platform.Library preparation was performed using the TruSeq Nano DNA Kit (Illumina, NE, USA) according to the manufacturer's instructions.

Fig. 2 .
Fig. 2. Whole genome sequencing quality control result.The Qscore measures the quality of the base calls in sequencing read, with higher Qscore indicates higher quality base calls.Qscore plot generated by FastQC shows the distribution of Qscore across all reads, x -axis representing the position in the read and the y-axis representing the Qscore.Green zone on Qscore plot indicates good quality score, meanwhile yellow indicating moderate quality scores, and red indicating poor quality scores.

Fig. 3 .
Fig. 3. Percentage of GC content in each sample.The plot shows the distribution of GC content, x -axis representing the GC content and y -axis representing the number of sequences.The central peak in the plot corresponds to the overall GC content of the underlying genome.GC content of the central peak corresponds to the expected %GC for the organism, and the distribution should be normal unless there are over-represented sequences (sharp peaks on a normal distribution) or contamination with another organism (broad peak).