Data on the application of the molecular vector machine model: A database of protein pentafragments and computer software for predicting and designing secondary protein structures

Based on ideas about the molecular vector machine of proteins [1], a database of protein pentafragments has been created and algorithms have been proposed for predicting the secondary structure of proteins according to their primary structure and for designing the primary protein structure for a given secondary structure that it takes on. A comprehensive software suite (Predicto @ Designer) has been developed using the pentafragments database and the said algorithms. For the proteins used to create the pentafragments database, a high accuracy (close to 100%) in predicting the secondary protein structure as well as good prospects for its use for designing secondary structures of proteins have been demonstrated.


Data
In this paper, software is described based on the model [1]. The process of predicting secondary protein structure described in the patent [2]. An example of prediction result is given in Table 1, A (a fragment of porcine myoglobin [3]). This fragment illustrates that the whole fragment under consideration can be predicted as a sequence of 10-digit numbers. The comparison with structured experimental data [4], visualized with "Protein 3D" software [5], proved that the software predicts this structure correctly (Fig. 1).
Correction of prediction. Since our approach uses digital description of pentafragment conformations, replacement of a single amino acid has an impact on prediction accuracy, which is a disadvantage of this method. In this situation, if some pentafragment is missing in the database for any reason, a gap in the structure is predicted, which is clearly seen in Table 1, A on the example of alligator's myoglobin fragment [5]. However, this disadvantage can be rectified by employing correction methods that we have developed [6]. A method for replacement of amino acids is the most interesting among them (See below).
The results given by this method are shown on the example of alligator myoglobin, whose primary structure was determined by Ref. [7]. Whereas the results in the middle column in Table 1, to which correction was not different amino acids in i-th position, then it is possible to replace the original pentafragment search with the search for pentafragment with similar structure but with amino acid changed in i-th position.

Creating the database of protein pentafragments
Text files describing hydrogen bonds in the secondary structure of proteins were obtained on the basis of about 2333 PDB-files of the Protein Data Bank (subunits e 2446). The list of proteins is given in the appendix. With the help of the Protein 3D program developed by us [5] (the program is free to Specifications Table   Subject biology Specific subject area database of protein pentafragments and computersoftware Type of data Value of the Data A database of protein pentafragments, sorted according to a binary description of their structure. A computer program Predicto @ Designer using this database and algorithm has been written. This program may be useful in the problems of predicting and designing of protein structure. The obtained data can contribute to the development of a database and computer software. Bold indicate substitutions of amino acids in the polypeptide chain at which the prediction in column B occurs. The substituted amino acids used are shown in this column to the right. download), these files were processed in a step-by-step fashion using mini-programs with a view to obtaining and sorting pentafragments. The steps are listed below.

Obtaining text files
Open the source PDB file using the Protein 3D program. The Rendering icon in the CIHBS settings submenu will show us the type of protein with a specification of its hydrogen bond systems. Next, in the CIHBS icon, check the box against the line item named Trace in memory. Open the bond types table from the Select bond types line item using the dropdown arrow, check the boxes against the NiH … Oie3 and NiH … Oie4 bonds, and uncheck the Show all line item. Next, click on the Show selected bonds line item and click OK. This will open a window with information about the H-bonds of the protein.
After clicking the Save links button, we will get a text file with a description of these links. Table 2, A shows a sample fragment from a 1MWC text file (Sus scrofa myoglobin).

Inverting text files
For the Predicto @ Designer program to work, the amino acid sequences contained in our pentafragments database need to be written from bottom to top. This pattern simulates the protein synthesis process, which evolves from the N-end to the C-end. The Invertor program takes the data written in the text file and rearranges them from the bottom up (Table 2, B).

Cutting text files into pentafragments
Using the cutter_u program, cut the inverted files into pentafragments that will store information about the arrangement of H-bonds. Cutting is done by shifting the frame by one amino acid. Table 2, C shows some examples of such pentafragments.

Sorting and simplifying pentafragments
Use the Selector program to sort the pentafragments obtained as shown above in accordance with the link encoding system we have adopted (see Tables 3 and 4). Use the Simplification program to simplify the files obtained ( Table 2, D).
An identification system was developed to sort pentafragments in database folders based on the binary coding of H-bonds [8e11]. An example of describing the structure of pentafragments with the help of implemented coding is given in Table 3. In this case, the 10-digit numbers describing a conformation of pentafragments were transferred to the file names (Table 3, E).
Subsequently, this coding procedure became more complicated (Table 4). Additional figures to identify various types of secondary structures were introduced, but retained its binary principle [11].
The structure of the database organized in accordance with the link encoding system as per Table 4 is shown in Table 5. It consists of folders containing pentafragment files and designated by the i th pair of variables (see the Folder numbering column, Table 5), of files enclosed in these folders and containing 10-digit numbers that describe the structure of the pentafragments (column 2), and of pentafragments contained in these files and associated to their specific positions in proteins (column 3). To speed up the search for pentafragments, the software has the database written in the form of strings (see Ref. [6] for an example).

Program layout
The computer program named PREDICTO @ DESIGNER The program is written in C þþ. It has been registered [12] as well as described in detail in Ref. [13]. For the program, a file of the.pdb format (Protein Data Bank) and.gen (Genbank) can be used, which are transformed by the program into the.dbk format ( Table 6, A) in which the program predicts the secondary structure of the protein. The result of the program is written in.dbkx format ( Table 6, B). Fig. 2, a shows the startup screen of the PREDICTO @ DESIGNER program. Clicking on the word PREDICTO sets the program to the secondary protein structure prediction mode (Fig. 2, b shows the workspace where digital and structural information is displayed) and clicking on the word DESIGNER sets it to the design mode (Fig. 2, c shows the workspace, control panel, and icons used to display information required for the design).

The procedure for prediction
The method of predicting secondary protein structure described in the patent [2] consists in isolating pentafragments in a file with specially formatted primary structure of proteins (files.dbk) and their search in the Database. Since every pentafragment has a 10-digit identification number in the Database, the software reads the code number of the found pentafragment and displays it onto the numeric operating field in a bottom-up sequence progressively as pentafragments are selected in a protein chain from start to finish. This procedure consists of two stages: an initial pentafragment is found at the first stage and if it is detected correctly then the remaining protein is predicted further at the second stage [2]. It has been found that when applying this approach, the secondary structure of all proteins used to develop the database is predicted with an accuracy close to 100%.

Prediction correction method by replacement of amino acids
The method consists in the following [6]. Let us assume that at some i-th stage the software has isolated a pentafragment to be searched for that has not been found under a code number defined on the basis of search algorithm. If this pentafragment could be found at the previous i-1-th stage, then it is all about the amino acid that appeared in the pentafragment at the i-th stage. It is well known that these changes (mutations) are frequently observed for the same type proteins but extracted from different kinds of organisms. Because the search for pentafragment with missing i-th amino acid should be conducted under the same folder number, as for the other pentafragments with similar structure but with applied, show quite low prediction accuracy, a region with amino acids from 115 to 138 (Table 1) was completely predictable as a result of applying this method. Comparison of the predicted structure of alligator myoglobin with porcine myoglobin (Table 1, left column) shows that in Table 3 Notations of bonds in text PDB-files (A), types of H-bonds (B), their coding with Boolean pairs of variables (C). an example of pentafragment (D) and its 10-digit description (E). In cell D, the selected first two lines correspond to the highlighted designation 01 in cell E.

Further ways to develop the prediction method
Applying the described prediction correction method is convenient and relevant to use for the groups of proteins with similar structure but derived from different species (as in cases with myoglobins and other heme-containing proteins). Ideally, it would be better to have a universal database that could be used to predict secondary structure of any protein with high accuracy. We have shown a practical possibility for creating it [14]. However, a high increase in the number of pentafragments in the database significantly increases the number of alternative options for prediction of secondary structures. This, in its turn, sharply slows down software performance and deteriorates the prediction quality.
Due to the above-mentioned reasons, we believe it is more relevant to develop ad-hoc databases aimed at predicting structurally close proteins. In this case, a universal database can be built on the basis of hierarchical structure of specialized databases. A prediction algorithm will consist of two stages: a) preliminary search of common elements being attributable to certain protein groups; b) final prediction based on a specialized database. There is a lot of work to be done in this respect, but the results of this work seem to be quite promising.

Developing a design method for secondary structures
Because the proposed approach can predict secondary structures of proteins quite accurately, it would be logical to apply the same approach to design secondary structures based on the predefined secondary structure. This method is detailed in the patent of [15]. It is implemented in the Designer section [13] of the Predicto @ Designer software. The initial protein pentafragment and its description in the form of 10-digit number in the binary numeral system is set using the control panel. The selected pentafragment is searched for in the database and, if it is found, then it is necessary to see one new amino acid and 10-digit description of a new pentafragment containing the previous four amino acids and one new and run a new search in the database. If the new pentafragment is found, then the procedure should be repeated.
The description presented in the patent is based on the data available in literature, and therefore, it confirms the feasibility of this design. However, before this method is recommended for a large-scale implementation, it must pass a more comprehensive experimental validation on the basis of up-todate scientific and engineering know-hows. The studies are being carried out in this respect.