Data from computational analysis of the peptide linkers in the MocR bacterial transcriptional regulators

Detailed data from statistical analyses of the structural properties of the inter-domain linker peptides of the bacterial regulators of the family MocR are herein reported. MocR regulators are a recently discovered subfamily of bacterial regulators possessing an N-terminal domain, 60 residue long on average, folded as the winged-helix-turn-helix architecture responsible for DNA recognition and binding, and a large C-terminal domain (350 residue on average) that belongs to the fold type-I pyridoxal 5′-phosphate (PLP) dependent enzymes such aspartate aminotransferase. Data show the distribution of several structural characteristics of the linkers taken from bacterial species from five different phyla, namely Actinobacteria, Alpha-, Beta-, Gammaproteobacteria and Firmicutes. Interpretation and discussion of reported data refer to the article “Structural properties of the linkers connecting the N- and C- terminal domains in the MocR bacterial transcriptional regulators” (T. Milano, S. Angelaccio, A. Tramonti, M. L. Di Salvo, R. Contestabile, S. Pascarella, 2016) [1].

& 2016 The Authors. Published by Elsevier Inc. This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/).

Subject area
Biology More specific subject area

Structural properties of linkers in the bacterial transcriptional regulators
Type of data Linker sequences were extracted from multiple sequence alignments of MocR regulators. Computational analysis defined the residue and residue dyads propensities and the distribution of physicochemical properties in the linker sequences.

UniProt, RefSeq
Data accessibility Data is within this article. Linker sequence sets are available at https://sites.goo gle.com/a/uniroma1.it/pascarellalab/home/resources

Value of the data
Data represent the description of the structural properties of the peptide linkers connecting the Nand C-terminal domains in the MocR bacterial regulators.
Data provide researchers with a framework to select specific MocR for experimental characterization. Data provide a support to design experiments for the investigation of properties of specific MocR: for example, experiments of site-directed mutagenesis, deletions or insertions of linker regions.
Data can help interpretation of experimental data obtained from MocR studies. Data provide a framework to derive rules for de-novo design of peptide linkers with desired properties.

Data
Results derived from computational analysis of the inter-domain sequences of the peptide linker connecting the N-terminal and the C-terminal domain of the bacterial transcriptional regulators of the subfamily MocR are herein reported. Data are shown as tables describing linker statistics such as residue and dyad composition propensities, predicted secondary structure frequency, and box-plots showing the distribution of several structural properties. Moreover, plots of length distributions of linkers from two specific MocR subgroups, namely PdxR and GabR, are also reported.

Experimental design, materials and methods
Data was created from the analysis of MocR sequences taken from the most populated phyla Actinobacteria, Firmicutes, Alpha-, Beta-and Gammaproteobacteria. Sequences of the MocR regulators in each phylum were retrieved from the UniProt data bank [2] accessed on October, 2015 with the application of RPSBLAST of the BLAST suite [3] and the CDD data bank [4]. The protein sequences containing both the wHTH and AAT domains identified by RPSBLAST were considered genuine MocR regulators. Before further processing, retrieved sequences were filtered at 75% sequence identity with the program CD-HIT [5]. Multiple sequence alignments were calculated with the programs ClustalO [6] and processed with the software Jalview [7]. Linker sequences were manually extracted from the multiple sequence alignments according with the wHTH and AAT domain boundaries assigned by RPSBLAST. List of the MocR regulators possessing linkers longer than 60 residues is reported in Table 1. Residue frequency and propensities were calculated as described in [1] and are displayed in Tables 2-5 organized according to linker length and phylum class. Propensities for the entire linker set are reported in [1]. Dipeptide frequency and propensity Table 3 Residue propensities in the linkers of length range 21-40. a Amino acid one-letter code. calculations relied on the software 'compseq' of the EMBOSS suite [8]. Table 6 reports the average number of residue dyads in each group. The highest the number, the highest the reliability of the dyad propensities reported in Figs. 1-5. Average content of predicted secondary structures (obtained with the program PREDATOR [9]) are displayed in Table 7. Physicochemical properties were assigned to the amino acid residues according to the indices provided by the AAindex data bank [10] incorporated in the Interpol package [11] of the R-project library [12]. Distribution of the properties are reported as box-plots in Figs. 6-10 limited to the phyla Alphaproteobacteria, Betaproteobacteria and Gammaproteobacteria and in Figs. 11 and 12 for all the phyla considered. Boxplots for Actinobacteria and Firmicutes missing in Figs. 6-10 are to be found in [1].
The linker length distribution were analyzed within two specific MocR subfamilies: GabR [13] and PdxR [14] involved in the regulation of the synthesis of acid γ-amino butyric and pyridoxal 5 0 -phosphate, respectively. Sequences assigned to each of the two subgroups were retrieved from the RegPrecise data bank [15] and aligned separately (Table 8); a HMM profile [16] was calculated for each one of the multiple alignment. The profile was utilized to search for other putative GabR or Table 4 Residue propensities in the linkers of length range 41-60. a Amino acid one-letter code. PdxR sequences in the reference proteomes data bank available at the Hmmer web server [17]. Sequences showing an E-value smaller than 10 À 120 , were retrieved and multiply aligned. Linker sequences were extracted as described above. Length distribution were plotted and compared for the GabR and PdxR sets (Fig. 13).
Perl and R-scripts were written for data analysis, processing and display. Number of residues in the sample.

Table 6
Average number of residue pairs in each data set.    Table 2 in [1] and code VINM940101 in AAindex [10]). Horizontal axis indicates the average flexibility distribution in the wHTH, AAT domains, in all linkers, and in linkers belonging to different length intervals: 0-20, 21-40, 41-60 and 460 residues. Y-axis reports the flexibility scale (label AI stands for Average Index). A, B, and C, denote Alphaproteobacteria, Betaproteobacteria, and Gammaproteobacteria, respectively.  Table 2 in [1] and code CIDH920105 in AAindex [10]). For interpretation of plots, refer to Fig. 6 caption.  Table 2 in [1] and code GEOR03010 in AAindex [10]). For interpretation of plots, refer to Fig. 6 caption. Fig. 9. Box plots of the distribution of the average normalized β-turn propensity (index #37 Table 2 in [1] and code CHOP780101 in AAindex [10]). For interpretation of plots, refer to Fig. 6 caption.  Table 2 in [1] and code CHAM830101 in AAindex [10]). For interpretation of plots, refer to Fig. 6 caption. Fig. 11. Box plots of the distribution of average normalized α-helix propensity (index #38 of Table 2 in [1] and code CHOP780102 in AAindex [10]). A, B, C, D and E denote Actinobacteria, Alphaproteobacteria, Betaproteobacteria, Firmicutes and Gammaproteobacteria, respectively.  Table 2 in [1] and code CHOP780103 in AAindex [10]). Letter interpretation is as in Fig. 11 caption. Percentage (%) on the vertical axis indicates the fraction of linkers in the length interval. Sequences were retrieved from the reference proteomes data bank available at the Hmmer web server [17] using a significance E-value thresholds equal to 10 À 120 . With this threshold, 885 and 334 sequences were retrieved for GabR and PdxR, respectively.