Optimization and Corroboration of the Regulatory Pathway of p42.3 Protein in the Pathogenesis of Gastric Carcinoma

Aims. To optimize and verify the regulatory pathway of p42.3 in the pathogenesis of gastric carcinoma (GC) by intelligent algorithm. Methods. Bioinformatics methods were used to analyze the features of structural domain in p42.3 protein. Proteins with the same domains and similar functions to p42.3 were screened out for reference. The possible regulatory pathway of p42.3 was established by integrating the acting pathways of these proteins. Then, the similarity between the reference proteins and p42.3 protein was figured out by multiparameter weighted summation method. The calculation result was taken as the prior probability of the initial node in Bayesian network. Besides, the probability of occurrence in different pathways was calculated by conditional probability formula, and the one with the maximum probability was regarded as the most possible pathway of p42.3. Finally, molecular biological experiments were conducted to prove it. Results. In Bayesian network of p42.3, probability of the acting pathway “S100A11→RAGE→P38→MAPK→Microtubule-associated protein→Spindle protein→Centromere protein→Cell proliferation” was the biggest, and it was also validated by biological experiments. Conclusions. The possibly important role of p42.3 in the occurrence of gastric carcinoma was verified by theoretical analysis and preliminary test, helping in studying the relationship between p42.3 and gastric carcinoma.


Introduction
The occurrence and development of gastric carcinoma is a multifactor, multistage, and multistep process [1]. A large number of molecules have been involved in it and constituted a complex regulatory network [2]. Finding and identifying the key biomarkers of high-risk warning, early diagnosis, and effective treatment of gastric cancer are a focus of gastric cancer research [3]. So far, studies have confirmed that multiple antioncogenes such as PTEN [4], p16 [5], p21 [3], Smad4 [2], Fas [6], and RECK [3] and oncogenes such as, Ras [7], c-myc [1], and MMPs [8] are associated with the development of gastric carcinoma. p42.3 is a novel gene, cloned by applying synchronization, mRNA differential display and bioinformatics. Researchers have proved that p42.3 may play a vital role in the occurrence and development of gastric carcinoma [9]. Some studies indicate that p42.3 has the characteristic of oncogenes and tumor markers and it may be one of the early molecular events in the development from gastric mucosa lesion to gastric carcinoma [10]. However, these results did not explain systematically the specific function of p42. 3

in it.
According to our early study, p42.3 may be involved in the regulatory pathway in the occurrence and development of gastric carcinoma and the regulatory pathway is as follows: Ras → Raf-1 → MEK → MAPK kinase → MAPK → microtubule-associated protein → spindle protein → centromere protein → cell proliferation [11]. Nevertheless, it has not been verified by molecular biological experiments. On the basis of our previous study, through improvement of the similarity algorithm between the reference protein and p42.3 protein, this study investigated the biological features of p42.3 by means of the regulatory network of the reference protein, optimized the regulatory network of p42.3, modulated the maximum possible pathway correspondingly, and verified it by preliminary molecular biological experiments.

Materials.
Gastric carcinoma cell lines BGC823, MGC803, SGC7901, AGS, N87, and GES1 were provided by Beijing Cancer Hospital (the original sources were from Shanghai Bioleaf Biotech Co., Ltd.) and were cultivated in DMEM culture media with 5% fetal bovine serum in a 5% CO 2 cell culture box at 37 ∘ C.

Structural Features of p42.3.
After obtaining the amino acid sequence of p42.3 (GenBank: NP 848543) from NCBI database, the spatial structure of protein was predicted by the threading prediction tool Phyre 2 (http://www.imperial.ac .uk/phyre/, Imperial College London) [12]. Then, relevance to cell proliferation in terms of function was set as the restrictive condition, based on which the protein with the two structural domains were searched and constituted the data set of reference proteins. Whereby, the possible biological property of p42.3 was studied.

Similarity Calculation of the Reference Protein and p42.3 Protein.
Multiparameter weighted sum method was put to use in calculating the similarity of reference protein and p42.3. First select several parameters in which the two proteins have similarity to calculate the degree of similarity of each parameter, and then add the weight trained by artificial neural network. Finally, obtain the degree of similarity after a weighted summation.

Selection of the Parameters.
According to the literature data, the following nine parameters of protein similarity were selected: protein spatial structure, the number of atoms inside the molecule, the number of amino acids in each protein, the species of amino acids, the location of element P and element S in the protein molecule, and the proportion of the number of atoms C, N, and O in the protein molecule [13][14][15][16].

Similarity Calculation of the Spatial Structure of Protein.
Before calculating similarity values, the coordinates of each atom in the protein structure file (pdb file) were determined and Euclidean coordinates were used as spatial coordinates, with the geometrical center of the protein as the origin. The distance from each atom to the origin was then calculated. According to these distances, the protein was divided into layers and the structure similarity of two proteins in corresponding layers was analyzed by stratified analysis. It was found that the distances between most of the atoms of p42.3 protein and the origin were in the range of 0∼80 nm and a small portion of the distance were within 80∼100 nm, and also, very few of them were above 100 nm. Therefore, based on the length of radius, p42.3 protein was divided into 10 layers from the center to outer edge. The distances of each layer were as follows: the first layer 0∼10 nm; the second layer 10∼20 nm; the third layer 20∼30 nm; the fourth layer 30∼ 40 nm; the fifth layer 40∼50 nm; the sixth layer 50∼60 nm; the seventh layer 60∼70 nm; the eighth layer 70∼80 nm; the ninth layer 80∼100 nm; and the tenth layer beyond 100 nm. The number of atoms in each layer was counted for each of the proteins being compared and stored in array vector data 1 and data 2, respectively. The similarity in atom numbers in each layer was then compared using the formula: sim = 1 − (|data 1 − data 2|/data 1), wherein sim represents a ten-dimensional vector that has stored the similarity of each layer.
Weights were then added to the similarity of each layer and the overall density similarity was calculated by the weighted summation method. It is reasonable to suppose that the layers that contain the most atoms will be more likely to determine properties of the protein. Based on this assumption, the more atoms the layer owns, the higher the weight of this layer is, so the proportion of the atoms number in each layer determined the weight of this layer. Of course, it is maybe different in every layer for two proteins, so the average would be taken. Hence, each layer was weighted as the following formula: = (( 1 / 1 ) + ( 2 / 2 ))/2, = 1, 2, . . . , 10, where 1 is the total number of atoms of the first protein, 2 is the total number of atoms of the second protein, while 1 and 2 are the number of atoms in the th layer in protein 1 and protein 2, respectively. Thus, the spatial structure similarity of the two proteins was obtained.

Similarity of the Total Number of Atoms and the Number and Type of Amino Acids.
Similarity algorithms of the three parameters were alike. The number of atoms and amino acids, and the number of amino acid types in the two proteins were calculated by textread function in MATLAB software. Then, the number of atoms and the number and type of amino acid can be read from the pdb file of the two proteins. The total number of atoms of the two proteins was recorded as 1 and 2 , respectively, and then the formula used to calculate the similarity in atom numbers was sim = 1 − (| 1 − 2 |/ 1 ). Likewise, the similarity of the number of amino acids and its types could be also obtained.

Similarity of Each Element.
This study was mainly to analyze elements C, N, O, P, and S. Firstly, the proportion of the number of C, N, and O to the total number of atoms in each protein was calculated. Then, the similarity was calculated among C, N, and O in accordance with the formula: sim element = 1 − (| 1 − 2 |/ 1 ). In addition, in protein molecules, the number of elements P and S was usually small, but they both play crucial roles in the function of protein. While in p42.3 protein, there was only one S atom and no P atoms. Therefore, it is obviously not scientific to calculate the degree of similarity according to the number of atoms of the two elements. Instead, similarity of the location between atoms P and S was set as the criteria for calculation. In this algorithm, it was assumed that if the two elements P and S were in the same layer, the similarity was regarded as 1.0; if they were in adjacent layers, the similarity was 0.8; otherwise, the similarity was 0. Therefore, the similarity parameter of each element in proteins was achieved.

Calculation of the Weight of Each
Parameter. Based on the similarity of each parameter of protein that had been figured out, the overall similarity was worked out by the weighted summation method. Before this, data of 100 pairs of similar protein pairs had been collected. According to the methods described above, the similarity of each parameter in each pair of proteins had been calculated: S1-S9. Then, BLASTp was used to search the homology of each pair of proteins, which was regarded as the overall similarity. Therefore, for each pair of proteins, a similarity data vector of 1 * 10 can be achieved: [ 1, 2, 3, 4, 5, 6, 7, 8, 9, ].
Then the similarity data of the 100 pairs of proteins was input to BP (back propagation) artificial neural network for training; thus, the weights of each parameter Qi had been achieved (Table 1). Therefore, for each pair of proteins, their overall similarity can be calculated by formula: = 0.3183 1 + 0.0343 2 + 0.0204 3 + 0.0603 4 + 0.0653 5 + 0.1062 6 + 0.1002 7 + 0.1477 8 +0.1480 9 . In this formula, is the overall similarity of the two proteins.
represents the similarity of each parameter. = 1, 2 . . . , 9 were spatial structure (density), number of atoms in the protein, number and type of amino acids, number and proportion of C, N, and O atoms [8], and spatial position of P and S atoms in the protein, respectively. On the basis of this formula, the similarity of the reference protein and p42.3 was thus figured out and the data set of the reference proteins was composed.

Construction and Optimization of a Bayesian Regulatory Network.
In condition of cellular proliferation, the reference protein set obtained by the similarity calculation was screened out. Then, with a reference protein as the starting point and cell proliferation as the ending point, the acting pathway and node of each reference protein were collected. There are crosses between different pathways, thus constituting a regulatory network [11]. In the network, "+" indicates a positive role in promoting the regulation; "−" represents a negative role in inhibiting the regulation. The similarity of each reference protein and p42.3 was set as the initially prior probability. By applying knowledge of conditional probability, the probability of occurrence in each node was worked out. The formula is Bayesian networks are Directed Acyclic Graphs (DAGs), which describe the joint probability distribution of a finite set of variables = { 1 , 2 , . . . , }. Bayesian networks can be symbolized by the element pair = ( , ), where is a DAG in which the nodes represent random variables 1 , 2 , . . . , . It can symbolize gene expression vectors in expression profiling data, while represents the conditional probability of each variable. DAG showed the independent relation under the following conditions. It was the Markov assumption; each variable was independent of its nonchild node in the prerequisite that it was the parent node in . Based on the assumption of independence, the Bayesian network had only one joint probability distribution for set was ( , . . . , ) = ∏ =1 ( | Pa( )), where Pa( ) symbolizes the parent node of . In order to determine the joint probability above, all the conditional probabilities in this formula need to be confirmed.
In the Bayesian network in this paper, after obtaining the probability of occurrence in each node and pathway, the Bayes theorem was used to inverse the probability of protein in acting their roles in each node, thus finding the highest possible regulatory pathway of p42.3 protein.

The Molecular Biological Test of the Optimal Path.
After obtaining the highest possible acting pathway of p42.3 protein through calculation and prediction, some basic biological experiments were carried out for initial validation. To begin with, Trizol (invitrogen, America) method was used for extraction of the total mRNA in the six cell lines: BGC823, MGC803, SGC7901, AGS, N87, and GES1. Through reverse transcription, cDNA was synthesized (cDNA reverse transcription kit, Thermo fisher scientific company, United Kingdom). According to the gene sequence of the reference proteins and p42.3, the primers were designed. The sequence of the primers was shown in Table 2. In contrast with -actin, the RT-PCR was used to amplify, respectively, (PCR amplifier, Eppendorf Company, Germany). After PCR products were detected by agarose gel electrophoresis, the expressions of various proteins in different cell lines were compared.

Structural Features of the EF-Hand and CC-Domain.
The spatial structure of protein was predicted by the threading prediction tool Phyre. A three-dimensional ligand-binding model of the characteristic of p42.3 in EF-hand region was predicted by using 3DLigandSite (http://www.sbg.bio.ic.ac .uk/∼3dligandsite/, Imperial College London) [17]. The metal ion binding sites of p42.3 were ALA78, SER79, TYR81, and ARG86, as shown in Figure 1. The protein data set that had high structural homology with EF-hand and CC-domain (p42.3 molecule) was searched. Some of them with the same structure of EF-hand were shown in Table 3. Then, proteins relating to cell proliferation functionally were screened out as the reference protein.

Similarity Calculation of the Reference Protein and p42.3.
The similarity algorithm of protein was compared by the similarity of nine parameters mentioned above. The MATLAB software (MathWorks, America) was used for programming.   The similarity of the reference protein and p42.3 was calculated. After screening by "cell proliferation, " the results were displayed in Table 4.

Bayesian Regulatory Network.
Cell proliferation was set as the restrictive condition, and the acting pathways and nodes of different reference proteins were worked out by literature collection. With different acting pathways crossing, the regulatory network was thus formed, shown in Figure 2.
The round nodes represent the reference proteins and they are the initial nodes. The relation between each node is upstream and downstream regulation. Arrows indicate the direction of action. "+": a positive regulation and "−": a reverse regulation. The similarity of the reference proteins and p42.3 was treated as prior probability of the initial parent nodes. According to formula (1), the probability of occurrence in each node downstream was calculated until figuring out the final results of cell proliferation. The final one is the probability of occurrence of that pathway. The results were shown in Figure 3. The probability of the path in thick line was 0.9781, higher than that of other pathways. Connected with the results of protein similarity comparison, it can be initially verified that the pathway "S100A11 → RAGE → P38 → MAPK → Microtubule-associated protein → Spindle protein → Centromere protein → Cell proliferation" was with the highest possibility.

The Molecular Biology Test.
Based on the analysis of the spatial structure of p42.3 and Bayesian regulatory network, expressions of S100A11 (the protein with the largest positive maximum weighted value) and S100A2 (the protein with the shortest negative acting path) in gastric carcinoma cell lines were examined, respectively, for preliminary valediction of the correlation of p42.3 and S100A11. The results showed that when p42.3 showed normal expression, both S100A11 and S100A2 had shown expressions. In Figure 4, it was indicated that expression of S100A11 was extremely similar to that of p42.3, while the expression of S100A2 was considerably different from that of p42.3. By referring to the analysis of the protein structure, it could be concluded that the regulatory pathway of p42.3 may be consistent with that of S100A11, or it may be involved in the regulatory pathway.

Discussion
The occurrence and development of gastric carcinoma involve changes in the structure and expression of multiple    related genes [9]. In particular, the activation of oncogenes and inactivation of tumor suppressors play important roles in it [10]. So far, many studies have tried to disclose the molecular regulatory mechanisms of gastric carcinoma in order to find biomarkers for the diagnosis and treatment of gastric cancer, which is expected to be an effective adjuvant therapy of surgery and chemoradiotherapy.
p42.3 expression is dependent on mitosis and is expressed at low levels or not at all in normal gastric mucosa but is highly expressed in gastric carcinoma tissues. It has the effect of promoting cellular proliferation and tumor metastasis [9]. Changes of p42.3 gene expression that occur during the development of gastric carcinoma indicate that p42.3 might be a direction of gastric carcinoma diagnosis and treatment  Figure 4: Expressions of p42.3, S100A11, and S100A2 in the cell lines of gastric carcinoma. [10,11]. It was found that an EF-hand structural domain existed in the N-terminal amino acid sequence of p42.3 protein, which also presented in the S100 family of proteins [36]. The EF-hand structure consists of a typical helix-loophelix structural unit; that is, two alpha helixes linked by a Ca 2+ chelate ring [37]. In all the reports about EF-hand structures, the majority of EF-hand structural domains are even number and form structural domain pairs, separated by connexin, or homologous or heterologous dimmers, such as S100 family proteins with two EF-hand structural domains [38,39]. Proteins with odd number of structural domains usually need to form homologous or heterologous dimers and their activity is presented in the form of dimer. CCdomain is a kind of super-secondary structure of protein, intertwined by two to seven helices (most commonly two or four) to form a braided structure [40]. Many proteins with coiled helical structures have significant biological functions, such as the transcription factor in the regulation of gene expression [40]. The most well-known proteins containing coiled helical structures are oncoprotein and tropomyosin. To study the action mechanism of p42.3, a similarity algorithm with multiparameter calculation was adopted to find proteins with high structural similarity to p42.3. As a result, proteins that might be related to the occurrence of gastric carcinoma were screened and treated as gene regulatory path nodes [11]. By a series of probability calculation, it was found that the possible action mechanism of p42.3 in the pathogenesis of gastric carcinoma was S100A11 → RAGE → P38 → MAPK → Microtubule-associated protein → Spindle protein → Centromere protein → Cell proliferation (Figure 3). And the initial molecule experiments also confirmed the consistency of p42.3 and S100A11 gene expression in gastric carcinoma cell ( Figure 4). The study of gene regulatory networks can be used to quantitatively mine information regarding gene expression regulation from one side. Through extracting and analyzing this information, gene function and genetic networks can be understood, and the pathogenesis of the disease will be clear. The study of gene regulatory networks aids in the exploration of gene function in the overall framework [11]. Genes' functions should be studied not only from a structural level but also from a network level. Genes affect each other and work together in intricate networks, which consequently contain new functions that cannot be fully revealed by the DNA sequence.
The S100 proteins are a group of calcium-binding proteins with low molecular weight (10-12 kDa). Its amino acid sequence is highly conserved in vertebrates [41]. S100 proteins share a high degree of homology with calmodulin and other EF-Hand calcium binding proteins [41]. From the biological function, specific expression and chromosomal localization in tumor of S100 protein family and the intimate relation between S100 protein and tumor can be found. Recently, studies have indicated that S100A11 (S100C) can serve as 7 a tumor suppressor protein in some tumors and a tumor promoter in other tumors [42]. S100A11 is upregulated in breast cancer, prostate cancer, and nonsmall cell lung cancer, where it promotes tumor metastasis and invasion [43,44]. On the contrary, S100A11 acts as a tumor suppressor in urinary bladder and renal carcinoma [45]. Our experimental results present an upregulated expression of S100A11 in gastric cancer. As a candidate tumor suppressor protein, the expression of S100A2 is significantly lower in a variety of malignancies, such as breast, liver, prostatic, and esophageal cancer [46][47][48][49]. Studies have indicated that S100A2 can inhibit cell proliferation and invasion and act as a tumor suppressor involved in the occurrence and metastasis of gastric carcinoma [50], which is in agreement with our findings. Through analysis of expression of S100A11 and S100A2 in gastric cancer, both of which contained EF-Hand structure, it was verified that p42.3 could participate in the occurrence and development of gastric cancer from both consistent and opposite to the p42.3 effect direction.
Currently, there are various ways to compare protein structures, each with their own advantages and disadvantages [51]. By analyzing the structure of the proteins, most of them calculate the similarity value of a pair of proteins by applying a mathematical algorithm. That is, from the spatial conformation of protein, they all have only analyzed the characteristics of spatial structure of proteins. The similarity of proteins in other aspects was not taken into account. For example, element P and element S are crucial to the functions of proteins. Using the multiparameter comprehensive comparison method, this study not only compared the differences between the two proteins in the spatial atomic density but also considered the similarity of many other aspects. When conducting the weighted summation of each parameter, the weights used all came from training of diverse data not from subjective weighting. It guarantees the accuracy of weight of each parameter and avoids the mistakes that some parameter is of little importance to the overall similarity but with high weight. Consequently, the similarity of two proteins was figured out more accurately. All the process of calculation was carried out by the M file compiled by MATLAB. Batch comparison of any amount of proteins could be carried out easily and quickly.

Conclusions
Here, the ligand-binding model of the EF-hand structure of p42.3 was successfully predicted. Meanwhile, a Bayesian network using the corresponding mathematical algorithm was constructed and optimized to predict the most likely pathway. On the other hand, molecular biology experiments indicated that p42.3 and S100A11 may be with the commonplace in character, and this provided a hypothesis for us to conduct further research. In a word, our findings provide important research directions for exploring the mechanism of action of p42.3 in gastric cancer.