Bacillus subtilis promoter sequences data set for promoter prediction in Gram-positive bacteria

This paper presents a prediction of Bacillus subtilis promoters using a Support Vector Machine system. In the literature, there is a lack of information on Gram-positive bacterial promoter sequences compared to Gram-negative bacteria. Promoter sequence identification is essential for studying gene expression. Initially, we collected the B. subtilis genome sequence from the NCBI database, and promoters were identified by their sigma factors in the DBTBS database. We then grouped the promoters according to 15 factors in 2 domains, corresponding to sigma 54 and sigma 70 of Gram-negative bacteria. Based on these data we developed a script in Python to search for promoters in the B. subtilis genome. After processing the data, we obtained 767 promoter sequences for B. subtilis, most of which were recognized by sigma SigA. To validate the data we found, we developed a software package called BacSVM+, which receives promoters as input and returns the best combination of parameters in a LibSVM library to predict promoter regions in the bacteria used in the simulation. All data gathered as well as the BacSVM+ software is available for download at http://bacpp.bioinfoucs.com/rafael/Sigmas.zip.


Value of the data
The data obtained can be used in further studies on gene regulation expression. The regulation of gene expression is essential for bacterial metabolic adaptation to environmental changes, allowing bacterial survival and multiplication.
Most related papers on bacterial promoters are restricted to Gram-negative bacteria, particularly E. coli. The promoters of B. subtilis described in this paper allow further research in this area.
Data on Gram-positive bacteria promoters in the literature are scarce. The process described here can be used by researchers to validate promoters in other bacteria of this type.

Data
Transcription at a coding region starts when the RNA polymerase (RNAp) enzyme recognizes the promoter region. Promoter regions are conserved DNA sequences that signal and direct the transcription of an adjacent gene or group of genes. Promoters are considered key factors for transcription as they are the initial step in gene expression and part of transcriptional regulation [13]. For this to occur, the sigma factor (a protein factor component of RNA polymerase) must be present on the holoenzyme. The sigma factor determines the specificity of the RNA polymerase on a promoter sequence. After RNA polymerase attachment, the sigma factor is released and gene transcription begins generating an RNA molecule [11].
A typical bacterial promoter is located approximately 70 bp upstream from the starting point of gene transcription. A comparative analysis of several sigma 70 promoters (Gram-negative bacteria) allowed the identification of two consensus sequences: (A) one localized at − 10 bp (5′-TATAAT-3′) from the transcription start point; and (B) another located at − 35 bp (5′-TTGAC-3′). These conserved regions define the affinity of the RNA polymerase complex for a promoter and the accuracy of gene expression. The aim of this paper was to study the promoter regions of Bacillus subtilis bacteria and to make a promoter data set available. This bacteria is considered a model organism in laboratory research due to its easy genetic manipulation [10]. The data that were obtained consists of 767 promoters separated into fasta files, each one representing a promoter sequence in B. subtilis with a length of 80 nucleotides.

Experimental design, materials, and methods
Initially, we collected the fasta file containing the genome of B. subtilis from the NCBI (National Center for Biotechnology Information, http://www.ncbi.nlm.nih.gov) database and promoters recognized by their sigma factors from the DBTBS (Database of Transcriptional Regulation in B. subtilis) database [17]. This included 15 factors, which we divided into 2 domains: sigma 54 (SigL) and sigma 70 (SigA and others). They are presented in Table 1 with the following informations: ORF (Open Reading Frame), description and operons. SigA stands out due to its high number of operons and promoters identified (46.07%). Fig. 1 shows the proportion of each sigma operons. Table 1 Sigma factors of B. subtilis [16].

30
SigG Control of transcription in the forespore at late stages of sporulation. 61 SigH RNA polymerase sigma-30. Non-essential sigma factor involved in expression of vegetative and early stationary-phase genes.

24
SigI Temperature-sensitive growth in a null mutant; transcription induced by heat shock in rich medium but not in minimal medium; reduced amount of GsiB protein in a sigI mutant under heat shock conditions.

1
SigK Formed by a site-specific recombination event that joins the previously separated spoIVCB and spoIIIC genes into a single cistron.

59
SigM Essential for growth and survival in high concentrations of salt; expression maximal during exponential growth and increased in high concentrations of salt; activity negatively regulated by YhdL and YhdK.

7
SigW ECF-type sigma factor that mediates the transcriptional response to cell wall stress. 34 SigX RNA polymerase SigX.
15 SigY RNA polymerase ECF(extracytoplasmic function)-type sigma factor 2 YlaC RNA polymerase ECF(extracytoplasmic function)-type sigma factor 1 The data obtained in DBTBS database had the following information: (1) Operon; (2) Regulated Gene; (3) Absolute Position; (4) Location; and (5) Link Sequence. Due to space restrictions, we only present the data obtained for sigma SigL operons in Table 2. This table describes the operon by its gene transcription, transcription start location, genome position (absolute position), binding sequence (red characters are the exact sequence and black characters are the start sequence) and experimental evidence (scientific work that prove the data).
Concerning the experimental evidence for sigma SigL, acoABCL was demonstrated by the mapping of the 5′ extremities of the mRNA by primer extension for the acoA gene and by homology analysis [1]. levDEFG-sacC was demonstrated by both mapping of the 5′ extremities of the mRNA by primer extension for the gene levD [10], the use of a reporter gene, and the disruption of the gene binding factor [7]. Finally, the verification of ptb-bcd-buk-lpdV-bkdAABB, rocABC, rocDEF and rocG came from the mapping of the 5′ extremities of mRNA by primer extension for the gene ptb [8], rocA [5], rocD [9] and rocG [2], respectively.
The FASTA genome file and the promoters obtained were used as input for a program written in Python [15] called searchPromoter.ph (source code in Appendix A). This program was developed to look for promoter regions in complete genomes. The program searched the promoters in the genome FASTA file using the absolute position and if the promoter was not found, the program searched for the sequence. This process was performed on all data obtained. After processing the data using this script, we obtained 767 promoter regions for B. subtilis, mostly related to sigma SigA. All data obtained are available for download at http://bacpp.bioinfoucs.com/rafael/Sigmas.zip. Fig. 2 shows an example of how the promoter sequence of the acuABC operon from sigma SigA was selected from B. subtilis genome.
To validate the data we found, we developed a software package called BacSVMþ that uses LibSVM library [6] to implement Support Vector Machines [3] for promoter prediction. It receives as input the promoters and returns the best combination of parameters of a LibSVM library to predict promoter regions in the bacteria used in the simulation. Its operation is based on the search for the best combination of LibSVM parameters to maximize prediction accuracy. For this, three steps must be followed during its execution: (A) data preparation; (B) support vectors training; and (C) promoter prediction.
The lack of a user-friendly database could make this first step demanding for users. In this context, the major innovation of BacSVM þ is its data preparation step. If the user does not have the promoters, the program searches (with the python script described earlier) the whole genome for promoters of the respective bacteria. Based on the promoters gathered during the first step, it is possible to define LibSVM parameters and simulate promoter classification.
LibSVM library allows setting a wide range of parameters, as shown in Table 3. Among them, the most important are the cost (C) and the gamma (G) parameters, where C indicates how much the support vectors are penalized when the prediction is wrong. In other words, this is the penalty when points are placed outside the range of correct classification in the hyperplane. On the other hand, the G parameter is a way to configure the kernel. In the case of a Gaussian function, this parameter controls the standard deviation function. BacSVMþ allows an extensive search of C and G parameters by setting a range of possible values.
Finally, in the last step, the user can predict promoter regions and the results can be exported to a text file or a spreadsheet. The architectures performance was evaluated for its accuracy (A), specificity (S) and sensitivity (SN) values, using the following formulas [18].  The results obtained in simulations with 767 promoters from B. subtilis are consistent with related works found in the literature, thus validating the data gathered. The best combination found was the NU-SVC and C-SVC algorithms with an RBF kernel, leading to a 93.20% and a 95.63% prediction accuracy, respectively. The main innovation of BacSVMþ is in the feature of promoter searching during the data preparation step, allowing the user to use the software even if they do not have promoters and nonpromoters examples for running the simulation. Our results can be seen in Table 4.
Related works that predict B. subtilis promoter regions with Support Vector Machines were found in the literature. Monteiro et al. [12] did not develop their own software. They used the WEKA software that unlike BacsVMþ , is implemented in Python and Java languages. In contrast to the 767 promoters used to validate BacsVMþ , 112 promoters of B. subtilis were used in their research. The accuracy they obtained was lower than the accuracy obtained with BacsVM þ, at 76%. Another group developed PePPER as a webserver-based promoter prediction tool (it does not require installation and can be accessed over the Internet), but they did not show results [4]. Finally, TSS SVM [11] analyzes the structural profiles of promoter regions, but it does not focus specifically on the problem of promoter prediction. The authors state that promoter regions are less stable and more rigid than the rest of the genome, but that this is less visible in Gram-positive bacteria such as B. subtilis.

Acknowledgments
This work was supported by grants from the National Council for Scientific and Technological Development (CNPq). The authors wish to thank University of Caxias do Sul and Federal Institute of Education, Science and Technology for their support of this research.