An experimentally supported model of the Bacillus subtilis global transcriptional regulatory network

Abstract Organisms from all domains of life use gene regulation networks to control cell growth, identity, function, and responses to environmental challenges. Although accurate global regulatory models would provide critical evolutionary and functional insights, they remain incomplete, even for the best studied organisms. Efforts to build comprehensive networks are confounded by challenges including network scale, degree of connectivity, complexity of organism–environment interactions, and difficulty of estimating the activity of regulatory factors. Taking advantage of the large number of known regulatory interactions in Bacillus subtilis and two transcriptomics datasets (including one with 38 separate experiments collected specifically for this study), we use a new combination of network component analysis and model selection to simultaneously estimate transcription factor activities and learn a substantially expanded transcriptional regulatory network for this bacterium. In total, we predict 2,258 novel regulatory interactions and recall 74% of the previously known interactions. We obtained experimental support for 391 (out of 635 evaluated) novel regulatory edges (62% accuracy), thus significantly increasing our understanding of various cell processes, such as spore formation.


RNA isolation, cDNA synthesis, labeling and hybridization to microarrays
Samples were immediately mixed with an equal volume of methanol (-20˚C) and centrifuged to pellet cells. Cell pellets were stored at -80˚C. RNA was isolated using a hot acid-phenol isolation procedure modified from Fawcett et al (2000). In this method, cell pellets are resuspended in LETS Buffer (10 mM Tris-HCl pH 8.0, 50 mM LiCl, 10 mM EDTA, 1% SDS) and then vortexed with acid-washed beads and hot (75˚C) acid-phenol, followed by addition of chloroform. The aqueous phase is next treated with hot phenol-chloroform and the subsequent aqueous phase is added to an equal volume of isopropanol to precipitate the RNA. The sample is centrifuged, washed, and resuspended in water. The sample is then treated with Qiagen's RNase-Free DNase in solution and the RNA clean-up protocol of Qiagen's RNeasy® kit was used. The quality of the RNA was checked by visualizing the integrity of the 23S and 16S rRNA bands on an agarose gel. The RNA was quantified using the nanodrop.
Labeled cDNA was generated from RNA using Agilent's Fairplay® III Microarray Labeling Kit. The Agilent Two-Color Microarray-Based Prokaryotic Analysis (FairPlay III Labeling) Protocol was used with the following changes: between 6-10 μg total RNA was used, the reverse transcription reaction was performed at 42˚C for 2 hours, and the NHS-ester dye-coupling reaction to amino allyl dUTP (using GE Healthcare Cy™3 and Cy™5 monofunctional reactive dyes) was incubated for 90 minutes. Hybridizations were incubated at 65˚C for 17 hours. Arrays were scanned by Agilent Technologies DNA Microarray Scanner with Surescan High-Resolution Technology.

Microarray design
The NCBI and EMBL annotations of the B. subtilis 168 genome were downloaded on July 28th 2010 and all annotated protein coding genes and non-coding RNAs were combined with small RNAs (sRNAs) identified in the tiling array and RNAseq transcriptional profiling experiments carried out by Rasmussen et al (2009) andIrnov et al (2010). All sequences were then combined using CD-HIT (Huang et al, 2010). CD-HIT was used to identify proteins that were 99% sequence identical over 99% of their length (primarily accounting for the large overlap between the NCBI and EMBL annotations). Agilent's eArray software (accessed via the Agilent website) was then used to design an array with three 60-mer probes (features) per protein-coding gene and 2 or 3 probes per RNA gene. We also incorporated 536 Agilent control features that are part of the Agilent platform. In total, the array has 15744 features for a total of 4,231 protein-coding genes and 659 putative non-coding RNAs (12 rRNAs, 49 tRNAs and 598 putative or previously characterized small RNAs) and can be ordered from Agilent (array name: b.subt-final-sense-3probes-july29).

Fluorescence microscopy
Fluorescence microscopy was performed as described before (Kim et al, 2006;Wang et al, 2009;McKenney et al, 2010). Briefly, 1 ml aliquots of the sporulating cultures were transferred into microcentrifuge tubes and spun down at 8000 rpm for 2 min in a microcentrifuge. Pellets were resuspended in 100 µl PBS supplemented with the membrane dye FM4-64 (Invitrogen) at 1.5 µg ml −1 final concentration. Two microliters of the concentrated sample were placed on a microscope slide and covered by a poly-L-lysine-treated coverslip for analysis. Images were taken using a Nikon 90i motorized fluorescent microscope equipped with a Roper 1 K monochrome digital camera and driven by the NIS-Elements AR 3.0 software. A 100× Plan Fluor 1.3NA objective (Nikon) was used for all image collection. Images were processed with ImageJ (http://rsbweb.nih.gov/ij/) or Photoshop for minor adjustments of brightness, contrast and color balance.

Spore purification procedure
Bacterial cultures (20 mL) were incubated in DSM for 48 hours and centrifuged at 5,000RPM for 10 minutes at room temperature. The pellet is washed three times with 20mL cold deionized water. In a separate Eppendorf tube, 200uL 50% renografin solution is added. The pellets are re-suspended in 400uL 20% renografin solution and transferred into the Eppendorf tube, layering carefully to avoid mixing of the two solutions. The Eppendorf tube is then centrifuged at 13,200RPM for 10 minutes at room temperature. The renografin solutions with different densities help filter out cellular debris, leaving purified spores pelleted at the bottom in the 50% renografin layer. Over time, the renografin may induce germination of the spores, so the pellet is washed three times with cold deionized water in order to remove all excess renografin. The purified spores are stored in 1mL PBS at -20°C. To test the adhesion of the spores to glass, purified spores are re-suspended in water to a final OD600 of 15. One hundred microliters of each sample are then transferred into Pyrex glass tubes and vortexed gently for 1 minute, giving the spinning spores sufficient time to adhere to the walls of the tube. The remaining solution is then aspirated off, leaving the residue behind to dry for ten minutes before image acquisition with a digital camera.

Computational search for DNA binding motifs
We used the MEME package version 4.9.1 (Bailey et al, 2009) to identify putative DNA sequence motifs characteristic of TF binding. For TFs with KO data, we manually defined the differentially expressed operons, i.e.
when adjacent genes oriented in the same direction were differentially transcribed we grouped them in an operon, the most upstream of the differentially expressed genes was selected as the first gene in the operon. For TFs without KO data, we used the operon predictions from the MicrobesOnline database (Dehal et al, 2010).
Subsequently, we extracted the DNA sequence from 0 to 200 bp upstream of the translational start site of the first gene in each operon predicted as target. MEME was run on each group of sequences using the zoops mode (zero or one occurrence per sequence), restricting the number of output motifs to 10 and using all other parameters as default. Next, we compared the retrieved sequence motifs with those described in previous reports (when available). For TFs for which known motifs were not fully recovered, we tried several combinations of minimum and maximum width parameters based on previously associated motifs (this was required for  G ,  H and  K ). An operon was considered to be a "known" target if any of its members was already part of the corresponding regulon in the prior network.

Gold standard (GS) network
In total, 3040 experimentally validated transcriptional interactions of B. subtilis, involving 1874 genes, were obtained from a slightly larger set downloaded from the SubtiWiki database (Michna et al, 2014) on March 26, 2013, and from a list of  A -controlled genes obtained from Helmann (1995). Genes filtered out during microarray processing due to a lack of expression variance were excluded from the GS. The GS network contains 1746 positive and 1294 negative interactions (Dataset EV2). There are 46 TFs with more than 10 targets, 32 TFs with more than 5 but less than 11 targets, and 75 TFs with less than 6 targets.

Inferring regulatory relationships using Bayesian Best Subset Regression
We now describe the BBSR method, an inference method that computes the regression models for a given gene i corresponding to the inclusion and exclusion of each TF that has a known regulatory effect on i, and the ten TFs with highest time-lagged CLR. Prior knowledge is incorporated by using informative priors for the regression parameters, and sparsity is enforced by a model selection step based on the Bayesian Information Criterion (BIC).

Bayesian Regression With Informative Prior
Here we introduce the linear regression we use during the model building step of the algorithm. We assume the prediction error ϵ i =y i −Â 'β i to be independent and identically distributed with mean 0 and variance σ 2 . The response variable of gene i is denoted as yi (the i th row of X), the design variables of TFs (TF activities, including time shift if applicable) as Â and the regression solution as βi. For clarity, we will omit the index i for the remainder of this section. We assume that the target gene response is distributed according to a multivariate normal ( y|β,σ 2 ,Â )∝ N n (Â 'β,σ 2 I ) with the predicted response as mean, and a variance co-variance matrix that has the error variance σ 2 on its diagonal and is 0 otherwise. In this formulation, n is the number of observations (experiments). This can be solved by a Bayesian regression where we can incorporate existing knowledge by tuning the prior on β.
We use a modification of Zellner's g Prior (Zellner, 1983) to include subjective information in our Bayesian regression problem. In the original formulation, the prior distribution of β has the following form ρ (β|σ 2 )∝N (β 0 ,g (ÂÂ ' ) −1 σ 2 ), i.e. a distribution proportional to a multivariate normal with an initial guess β 0 as mean and a data-dependent covariance matrix that is scaled by a user chosen factor of g є (0, ∞). The prior distribution of σ 2 is the same as is typically used with the non-informative prior, ρ (σ 2 )∝ 1 σ 2 . The choice of a large value for g will lead to results centered around the ordinary least squares solution, and the error variance will be the lowest. Values of g close to 0 on the other hand will lead to solutions that are centered around β 0 with higher error variance.
The joint posterior distribution has the functional form ρ (β,σ 2 | y )=ρ(β|y,σ 2 ) ρ (σ 2 | y ), and the marginal posterior distributions are where IG is the Inverse Gamma distribution with shape and scale parameter, and SSR is the sum of squares of the residuals of the ordinary least squares solution β ols .
With this set-up, we can propose a prior guess β 0 of the vector of regression coefficients, and encode our belief in this guess with g. To allow for different levels of confidence in the different elements of β 0 we extend the original formulation of the g prior to use a vector ḡ with one entry per predictor. The scale parameter of the Inverse Gamma distribution of the marginal posterior distribution of σ 2 then becomes where G is a square diagonal matrix whose diagonal entries starting in the upper left corner are √ 1 g+1 and all remaining entries are 0.
In practice, we choose β 0 to be a vector with all entries having the value 0. This reflects our prior belief that the regulatory network is generally quite sparse. We set the vector ḡ to values of g for those predictors that we have additional knowledge for and believe that they regulate gene i, and to 1/g for the other predictors. A value of g = 1 treats all predictors equally and we refer to it as 'no priors', whereas g > 1 allows the predictors with priors to explain for more of the variance of the response.

Model selection
We use the BIC to select the final model from the 2 p possible regression models for a gene i. For a given model m, the BIC is defined as where n is the number of observations and k the number of predictors. To be more robust, we avoid using a point estimator for σ 2 directly, but use the expected value of BICm based on the posterior distribution of σ 2 where shape and scale parameterize the marginal posterior distribution of σ 2 as stated above. As a final step, the predictors of the model with the lowest E[BIC] are selected as the TFs regulating gene i.
If p, the total number of potential predictors for a given gene, is large (> 10) it becomes infeasible to compute all 2 p possible regression models during the model selection step. To further reduce the number of predictors, we look at a subset of all possible models, and employ an averaging method to discover the 10 most promising predictors. We first build all models containing one or two predictors, and compute the expected BIC for each one. For every predictor we compute the average expected BIC of all models containing that particular predictor.
These averages allow us to rank the predictors, and to reduce the set to the 10 best predictors as defined by the BIC in this small subspace of all models.

Comparing TFA-BBSR to other inference methods
To better assess the performance of our method with respect to other state of the art network inference algorithms, we created networks using three additional methods: 1) ARACNE (Margolin et al, 2006), 2) CLR (Faith et al, 2007), 3) GENIE3 (Huynh-Thu et al, 2010). We implemented CLR using the mutual information estimates as provided by the parmigene R package (Sales & Romualdi, 2011). We obtained GENIE3 from http://homepages.inf.ed.ac.uk/vhuynht/software.html and used the R implementation with default parameters. All methods were tested using both data sets, and we modified CLR and GENIE3 to be able to supply design and response variables while keeping the core algorithms the same. This way, all three methods are given identical input (as defined by the TFA estimation and Inferelator core model) and differences due to the use of time-series information are eliminated. Additionally, we obtained ARACNE from http://wiki.c2b2.columbia.edu/califanolab/index.php/Software/ARACNE (aracne2 executable) and ran it using 66 combinations of different MI p-value and DPI tolerance parameters. P-value parameters where in the set of 1e -10 , 1e -9 , 1e -8 , …, 1, while DPI tolerance parameters where in the set 0, 0.05, 0.1, 0.15, 0.2, 1. We observed the highest performance at p-value parameter 1e -10 and DPI tolerance parameter 0 for which we performed a final run doing 50 bootstraps. As the performance (measured by AUPR) was only about 50% (0.076) of that of the other tested methods, we did not pursue ARACNE any further.

Network visualization
The combined network was visualized using Cytoscape 2.8 (Smoot et al, 2011). The multiColoredNodes plug-in (Warsow et al, 2010) was used to visualize information about the proportion of recovered priors and novel targets for each regulon.

Functional enrichment analyses
Gene annotations were retrieved from SubtiWiki and available literature reports. Genes associated with specific cellular processes were identified using gene categories defined in SubtiWiki (see above).

Differential gene transcription analyses
We compared gene transcription profiles of wild-type (WT) and the respective mutant strains (KO) using Bayesian t-tests with the Cyber-T tool (Baldi & Long, 2001). The main challenge we faced with the Bayesian ttests was the low number of differentially transcribed genes detected for several TFs after correcting for multiple hypothesis testing. This issue was also observed when other tools such as Limma (Smyth, 2004) were used. To evaluate if the small number of detected differentially transcribed genes was due to the lack of change in the transcription profile of expected genes (based on the regulon's information derived from the GS network), we computed the Precision Recall plot for each WT and KO comparison. We ranked the genes using the raw pvalues obtained from the corresponding Bayesian t-test (in increasing order) and computed the Area Under Precision Recall (AUPR) curve using the TF targets in the GS network as reference. To get the expected distribution of AUPRs for the case where the arrays were uninformative with respect to the expected set of differentially transcribed genes, we also computed the AUPR using multiple random rankings of genes. We used 0.01 as the threshold p-value for considering a gene differentially transcribed.

Network validation using gene counts
To test each TF-gene interaction we used the differential transcription analyses described above. Target genes were considered to be supported when differentially transcribed (for steady state experiments) or in at least one time point for time series. All tested networks consisted of the top 4516 predicted interactions of the respective methods.

Network validation experiments
Transcriptional profiling experiments to characterize the regulons of alternative   factors In this section, we present a detailed analysis of the network predictions for the alternative  factors and their validation by transcriptional profiling. From the data presented in Table I and Dataset EV6, three groups emerge based on accuracy of predictions. In the first group (high accuracy), composed of  E ,  F ,  G and  W , the accuracy for novel predictions was 0.8 or higher, even though compartmentalization of  factor activity during sporulation (which applies to  F ,  G ,  E and  K ) is a confounding factor. For instance,  F and  E are both active during the early stages of sporulation, but  F activity is limited to the forespore, while  E controls mother cell gene suggesting that we might be underestimating the number of true predictions, but additional experiments would be necessary to fully validate these putative new targets. It is important to note that many of the targets of  K are also dependent on GerE ( K and GerE form a feed-forward loop). When the contribution of GerE is taken into account, the accuracy of predictions for  K is 0.94.
The last group (very low accuracy) consists of just one  factor,  M (0.04). This is not caused by an inability to estimate activity, because 31 transcription units controlled by  M are present in the GS and 84% of these operons were recovered in the model. We believe that the group of putative  M targets contributed by the second dataset is made of co-regulated genes, but that the inference approach did not assign the correct TF(s) to it.

Other regulators
For other global regulators, accuracy for novel predictions was between 0.3 and 0.65 (Table I). One exception was repression by Spo0A, where the low accuracy (0.17) was likely a consequence of a noisy set of priors. The Spo0A genes in the GS had originally been identified as targets using a ChIP-on-chip approach (Molle et al, 2003) that lacked the precision of current approaches. While it is possible that our model could be improved by filtering priors for every TF; we do think that, overall, only very few predictions would change. Novel targets of ScoC were also predicted at low accuracy (2 validated operons out of 9 novel predictions); however, ScoC is only one of many regulators of the transition to stationary phase. Several of the new predictions for ScoC were shown to be differentially expressed in experiments with codY or spo0A deletion mutants (which are also involved in this transition).

External validation
In for CodY in the upstream sequence of predicted targets (Appendix Figure S5) and our own transcriptional profiling experiments with a codY gene deletion mutant ( Table I and Dataset EV6). Twelve out of 17 novel targets predicted for CodY were supported by these approaches. In general, transcriptional profiling experiments, bioinformatics analyses and external sources suggest an accuracy of at least 0.5 for novel predictions. Accuracy is expected to be greater for global regulators than local regulators, due to the fact that the number of priors is higher for these TFs and their activity can be more accurately predicted.

DNA motifs
In addition to CodY targets, we obtained conserved motifs matching previously known consensus sequences for  B ,  D ,  E ,  G ,  H ,  K ,  L ,  W and CcpA (Appendix Figure S5). The proportion of recovered and novel targets exhibiting putative binding sites does not significantly differ (although it is usually slightly lower for novel targets).
The highest proportion of novel targets containing a match to the consensus is obtained for  E (0.98). Known motifs were also identified for CymR, Fur, LexA, Zur and IolR (data not shown). Overall, when DNA binding motifs were identified, 68% of the novel targets were supported; in total, this represents 209 novel TF-operons interactions.

Competence (ComK regulon)
From the inferred TRN, we obtained a list of putative targets of ComK. Subsequently, we obtained experimental support for these interations by transcriptional profiling of a comK mutant strain. We grouped and color-coded these genes according to their functional annotation (Appendix Figure S6). As expected, the main categories were competence, DNA repair and recombination. New targets of ComK include coiA (yjbF), a gene whose role in competence had been previously recognized (Kramer et al, 2007), sacY which is involved in sucrose metabolism (Crutz & Steinmetz, 1992) and hxlR, which encodes a TF regulating the response to formaldehyde induced-stress (Yasueda et al, 1999). Another group is formed by genes encoding transporters (i.e. the ywfM and ycKBA operons, and the yvrPON operon, a previously known target). Thus, cellular functions of the ComK regulon are not restricted to competence, but impact a variety of processes, including a new connection with the oxidative stress response mediated by HxlR.

Metabolism (CodY regulon)
In Appendix Figure S7, we present a functional analysis of the CodY regulon. Importantly, our predictions are in good agreement with two recent publications reporting the global identification of CodY targets (Belitsky & Sonenshein, 2013;Brinsmade et al, 2014). We observed that the following processes are repressed by CodY: i) biosynthesis of branched chain amino acids (BCAA); ii) exploitation of alternative sources of nitrogen, including uptake and utilization of peptides and amino acids, regulation of the urea cycle, and protein degradation (2 new targets, ispA and vpr, encoding two serine proteases, contribute to this process). Recently, vpr was described as a CodY target (Barbieri et al, 2015); iii) carbon metabolism, with target genes encoding enzymes of the tricarboxylic acid cycle (citB and acsA); iv) production of antibacterial compounds, including bacG (a previously known target involved in the synthesis of bacilysin, whose regulation is also dependent on ScoC) and the pks operon (a novel target involved in the synthesis of polyketides); v) initiation of sporulation ( rapA-phrA, rapE-phrE and kinB), by controlling the phosphorylation of Spo0F, an intermediate in the pathway leading to activation of Spo0A; vi) membrane fluidity (yuaF-floT-yuaI operon); vii) 13 novel targets of unknown function (yknV, amhX, yjbA, sndC, yfmB, ykwB, yjcL, yjcK, yocS, yvdA, ywqJ, yqfZ and yhjB in an operon with a previously known target, yhjC). Thus, the addition of several genes of unknown function to the regulons of well-known regulators like ComK and CodY shows that even the best characterized cell processes contain previously unrecognized components and new connections to other cell processes.

Sporulation
In Appendix Figure S8A, we analyzed the expression of yetF, which encodes a putative membrane protein of unknown function. In our model, yetF is predicted to be a target of  F . Just upstream of yetF, lplD was previously recognized as a  E target (Eichenberger et al, 2004). Transcriptional profiling experiments with sigF and sigE mutant strains confirm that lplD is differentially expressed in a sigF mutant, while its level remains almost unchanged in a sigE deletion mutant. We conclude that yetF is primarily under  F control. Inspection of the sequence directly upstream of yetF reveals a near perfect match to the consensus sequence for  F promoters. In wild type cells, YetF-GFP localization is similar to that of other forespore membrane proteins, i.e. a ring of green fluorescence overlapping with the location of the forespore membranes. As expected, both sigF and sigE mutants exhibited the classic disporic phenotype characteristic of sigE inactivation [with polar septa formed at both ends of the sporulating cell (Pogliano et al, 1999), but a green fluorescent signal was only detected in a sigE mutant, demonstrating that yetF expression does not strictly require the  E -dependent promoter. Control experiments showed that the pattern of YetF-GFP localization is unaffected by inactivation of  G or  K .
In Appendix Figure S8B, we validate the prediction that the ykoS-ykoT operon is a novel target of  G . Both genes are predicted to encode membrane proteins, but, in addition, YkoT is homologous to a family of glycosyltransferases. Downstream of ykoT, we find a previously characterized sporulation operon, ykoV-ligD, involved in DNA repair via a non-homologous end joining mechanism (Wang et al, 2006). This second operon is known to be the target of a coherent FFL formed by  G and SpoVT. Transcriptional profiling experiments confirm that, as predicted, both operons are regulated by  G ; however, they also show that SpoVT is an activator of ykoV-ligD and a repressor of ykoST (therefore, ykoST operon is the target of an incoherent FFL formed by  G and SpoVT). Thus, although both regulators were correctly identified for ykoV (a gene included in the GS), the prediction for ykoST was incomplete (because it did not include SpoVT). As observed for many other targets of FFLs, the primary regulator ( G ) was accurately predicted, but the contribution of the secondary regulator (SpoVT) was missed. Importantly, we identified a putative binding site for  G with a good match to the consensus upstream of ykoS. In addition, as expected, a GFP fusion to YkoT showed the typical expression pattern of a forespore membrane protein. Furthermore, no expression was detected in a sigG mutant, while subcellular localization was not affected in either the spoVT or sigK mutants.
In Appendix Figure S8C, we characterize the expression and sub-cellular localization during sporulation of YkzQ-GFP, a putative novel target of  K . The gene immediately upstream of ykzQ (ykvP) is a previously known target of  K and GerE (Kodama et al, 2000). YkzQ is a small protein of 75 a.a., which was missed in the original annotation of the B. subtilis genome sequence (Kunst et al, 1997). It contains a LysM domain known to promote binding to peptidoglycan (Buist et al, 2008) and frequently found in spore coat proteins. We confirmed the prediction that expression of ykzQ was dependent on  K (presumably from the promoter located upstream of ykvP). YkzQ-GFP localizes to the spore coat, more specifically the outer coat, since localization was impaired in a cotE mutant.  This table reports the effect that the number of priors per TF has on the confidence scores of predicted interactions (for recovered priors and novel interactions) and the average number of novel targets per category.

Number of targets
Knock−out support Yes No  Figure S5. Presence of putative binding sites in predicted target operons. Putative binding sites found in he sequence upstream of operons predicted as targets for the corresponding TFs. Identified sequences agree with reviously recognized motifs. The number of targets displaying the motif is shown for each regulon. An operon was onsidered to be "known target" if at least one of its members was annotated as target in the GS.    Figure S7. Functional analysis of the CodY regulon. Predicted targets of CodY. Novel interactions are ndicated by dashed lines. Interactions recovered from the GS are indicated by solid lines. Dark brown lines indicate nteractions supported by transcriptional profiles, motif searches and ChIP-seq data (Belitsky & Sonenshein, 2013). ight brown lines indicate interactions supported by two out of the three supporting data types. Seagreen lines ndicate interactions supported by one out of the three supporting data types. Operons with genes assigned to more han one functional category were subdivided based on the functional annotation. Functional categories are coloroded. Diamonds indicate transcription factors.