Comparative genomics and evolution of transcriptional regulons in Proteobacteria

Comparative genomics approaches are broadly used for analysis of transcriptional regulation in bacterial genomes. In this work, we identified binding sites and reconstructed regulons for 33 orthologous groups of transcription factors (TFs) in 196 reference genomes from 21 taxonomic groups of Proteobacteria. Overall, we predict over 10 600 TF binding sites and identified more than 15 600 target genes for 1896 TFs constituting the studied orthologous groups of regulators. These include a set of orthologues for 21 metabolism-associated TFs from Escherichia coli and/or Shewanella that are conserved in five or more taxonomic groups and several additional TFs that represent non-orthologous substitutions of the metabolic regulators in some lineages of Proteobacteria. By comparing gene contents of the reconstructed regulons, we identified the core, taxonomy-specific and genome-specific TF regulon members and classified them by their metabolic functions. Detailed analysis of ArgR, TyrR, TrpR, HutC, HypR and other amino-acid-specific regulons demonstrated remarkable differences in regulatory strategies used by various lineages of Proteobacteria. The obtained genomic collection of in silico reconstructed TF regulons contains a large number of new regulatory interactions that await future experimental validation. The collection provides a framework for future evolutionary studies of transcriptional regulatory networks in Bacteria. It can be also used for functional annotation of putative metabolic transporters and enzymes that are abundant in the reconstructed regulons.

avoid skews in the consistency check approach and to simplify the simultaneous analysis in the RegPredict web server. At that we preferably selected most well studied genome representative in each set of closely-related genomes. Next we searched for orthologous TFs in the selected genomes using the bidirectional best hits approach and protein BLAST server at NCBI (Altschul, et al. 1997). For regulon reconstruction in each group of genomes possessing TF orthologs we used standard comparative genomics approach (Rodionov 2007) that consists of the next steps: 1.
Obtain training set of potential TFBS; 2.
Whole-genomic search for additional TFBSs and regulon members; 4.
PWM refinement and continue from step 2. For collection of training sets we used two strategies. (i) For studied known regulons we collected upstream regions of known to be regulated genes with attention for more precise information about location of TFBSs (as electrophoretic mobility shift assay or DNase footprinting assay). (ii) For novel TF regulons, we used genomic context analysis where we predicted regulation of neighborhood genes by their conservative co-localization in one locus mapped to phylogenetic tree of TF. Another approach is functional analysis based on assumption that genes from one metabolic pathway or one process should be regulated simultaneously. Based on this approach we taken upstreams of genes from one process. Association of TF with regulation was made by conservative co-localization of TF gene with genes from this pathway. Collected upstream regions were used to identify a common DNA motif using the Discover Profiles tool in the RegPredict platform (Novichkov et al., 2010). We searched for DNA motifs either palindromic or tandem repeat symmetry. Sequences of identified DNA motif sites were used to build PWM. The constructed PWMs were further used to search for additional potential TFBSs across upstreams of all genes in genomes using the RegPredict server. Typically we searched the regions beginning 400 nt upstream to and ending 50 nt downstream to the translational start of each gene. Typical threshold for site search procedure was selected as 10% less of the lowest site score from the training set. The whole genomic searches in RegPredict result in construction of a set of CRONs (Clusters of co-Regulated Orthologous operoNs). Each CRONs in RegPredict was built by the following algorithm: 1) PWM found potential TFBSs above threshold; 2) operon predicted by taking gene with potential TFBS as the first gene of operon and prolong operon to all genes with the same direction and intergenic distance less than 200 nt; 3) identification of orthologs and paralogs for each gene in this operon based on Orthologous Groups in MicrobesOnline database; 4) steps 2 and 3 repeated until convergence. Automatic construction of CRONs and manual curation of the obtained CRONs in the RegPredict server allowed us to filter out false positive site predictions by utilizing the consistency check approach. The consistency check approach is based on the assumption that true sites are conserved in evolution. It should be noted that the cases of operon gene content rearrangement are also taken into consideration in the course of CRON analysis and curation. On next step, the identified true positive TFBSs were added to refine PWM and further repeat the genomic site searches.
At the final step of the manual regulon annotation, gene functions are assigned using the existing gene annotations in Genbank and SEED databases (Overbeek, et al. 2005), annotations of homologous proteins in SwissProt / UniProt database (UniProt 2015) and analysis of Pfam domains (Finn, et al. 2016). All reconstructed regulons were finally deposited in the latest release of the RegPecise database (http://regprecise.lbl.gov) (Novichkov, et al. 2013).

Number of taxonomic groups with regulatory interaction
Conservation groups