Analysis and Research of Key Genes in Gene Expression Network Based on Complex Network

,


Introduction
With the development of gene chip and second-generation sequencing and the emergence of new technologies, the analysis of the human genetic structure has become a reality, the association information between genes is analyzed, being able to express genes related to key genes, and it is possible to analyze the correlation between genes and diseases more effectively. Establishing a biological network can obviously discover and analyze the correlation and influence between expression elements in various biological systems. It makes it possible to study the characteristics of organisms at a certain level [1]. Unlike the research on inherent single molecules, the information at the whole system level can be displayed by establishing networks.
At present, molecular biology network analysis tools in this field mainly include metabolic network [2], protein-protein interaction network [3], gene coexpression network [4], gene regulatory network [5], and signal transmission network [6]. In order to expand the scope and depth of the current research, the transition from a single molecule to a network system level becomes the next major development direction of interactive networks because of, in general, the complex biological phenomena and pathology that cannot be caused by only one factor. erefore, as a tool to explore complex pathological and life phenomena, the interactive network provides an effective analysis method, to perfect the comprehensive research. ese networks have great research value, among them; the gene coexpression network has irreplaceable research characteristics in some aspects. Whether the amount of gene expression can establish the correlation of related diseases, the expression network can establish the complex network, and it is challenging to analyze the gene correlation with the related technologies of the complex network.

Related Works
e reconstruction algorithm of the gene regulatory network and the exploration of a series of related models began in the middle of the 20 th century. Raterproposed the system and characteristics to control the interaction and interaction between all genes in prokaryotic cells exploring the mutual influences and interactions between genes in prokaryotic cells, as well as system relationships and characteristics. In 1969, Kauffman discovered and described the exploration research with epoch-making significance, using popular binary logic rules to stipulate the basic construction model and reconstruction algorithm of gene network [7][8][9]. D'Haeseleer et al. [10] were the first to use ordinary differential equations to do some research on time series data. Gardner et al. [11] were the first to prove that, by using their multivariate regression network identification (NIR) algorithm, the steady state measurements can be used to infer the network structure. ey studied a data set, overexpression of specific genes in bacterial models with plasmids, when the gene expression level reaches a new steady state value, using its measured values. is method has been suggested for transcriptase data sets including siRNA knockout experiments [12]. In the past decades, numerous computational methods have been proposed to infer gene regulatory networks. ese methods can be roughly divided into coexpression-based methods [13], supervised learning method [14], the model-based method [15], and the information theory-based method [16,17] with lower computational complexity. However, direct correlation and model system dynamics cannot be inferred. Based on the supervised learning method, we mainly use the known rules to infer gene regulatory networks on genome-wide data, such as SEREND [14], GENIES [15], and SIRENE [18], but all need additional information of regulatory relationships to cultivate models. By guiding reasoning from prior information of known rules, we can achieve higher accuracy and be superior to many other methods.
Constructing complex gene expression networks and using large-scale gene expression data sets for network analysis are effective methods to reveal new biological knowledge. However, methods for gene association in the construction of these coexpression networks have not been thoroughly evaluated. Because different methods lead to different coexpression networks with different structures and provide different information, it is very important to select the best gene association method. Zhang et al. [19] proposed the identification of protein complexes in protein interaction network (IPC-RPIN), which effectively fused topological structure, gene expression profile, and GO functional annotation information to achieve gene-protein expression complex. Hua et al. [20] proposed the fusion research of three commonly used reasoning algorithms to establish a genome-scale and high-quality gene coexpression network. After applying this expression network to monocotyledonous plant rice, the network quality has been verified and evaluated through the selected gene function association data sets, which is obviously superior to other methods. Li et al. [21] predicted subcellular location information by integrating time-history gene expression data with the spatial and temporal active protein interaction network (ST-APIN). In order to evaluate the efficiency of the proposed method, the commonly used classical clustering algorithm has obvious advantages in identifying protein complexes in ST-APIN and the other three dynamic PIN.

Technical Indicators Related to
Complex Networks 3.1. Gene Degree. Degree is an important index to describe the attributes of nodes. e degree of nodes refers to the measurement that genes are associated with genes. In directed complex networks, the degree of nodes can be divided into exit degree and entry degree, which indicates the number of genes pointing to other genes and the number of other genes pointing to the gene. e average degree of a complex network reflects the density program of the network, and the degree distribution can describe the importance of different genes through the number of genes and gene connections. When the scale of gene expression network is very large, the degree distribution of genes can fully display the distribution law of genes and can identify and distribute different types of networks. e distribution of gene complex network degree is described as follows: (A) Gene distribution function P(x): It indicates the proportion of gene x in the gene expression network in the whole expression network. (B) Cumulative degree distribution functionP x : e P x description degree is not less than the probability distribution of gene x, and the distribution function is as follows: If the gene degree of the gene expression complex network obeys the distribution function P(x)∞k − c of the power index c, it is said that the gene complex network obeys the scale-free network of power-law distribution, while the accumulation degree distribution function P x obeys the power-law distribution P x ∞x 1− c of power c − 1.

Gene Network Density. Gene network density d(G)
is used to describe the density of connection edges between genes in the gene network and is defined as the ratio of the actual number of gene connections in the gene network to the upper limit of the number of gene connections that can be accommodated. Dense programs and dynamic evolution laws used to measure the relationship between gene expression networks in gene complex networks. ere are N gene nodes and L edges in the gene complex network, and the density formula (2) of the gene complex network is defined as follows: e density range of the gene complex network is [0, 1]. When d(G) � 1, the network is all connected. When d(G) � 0, there are no connections in the network. In the actual gene network, the network with density 1 does not exist. In most networks, it is around 0.5. e density of largescale networks is smaller than that of small-scale networks, and the density of networks of different sizes cannot be directly compared. e absolute density formula (3) can be used to compare the network densities of different sizes: In the above, D represents the diameter of the network, R represents the radius, and S represents the circumference of the network.

Aggregation Coefficient of Gene Network.
Gene network clustering coefficient is used to describe the degree of node neighbors associated with gene nodes in a gene network. In the gene network, the node v i and the gene network aggregation coefficient C i represent the probability of interconnection with its adjacent gene nodes. k i is used to represent the number of neighbors interconnected by gene nodes v i , and e i is used to represent the number of interconnected neighbors existing in k i . k i (k i − 1)/2 represents the maximum number of interconnections, and equation (4) represents the aggregation coefficient of gene node v i : e aggregation coefficient of gene nodes is vividly described in the gene complex network, which can be understood as the correlation between one gene node and another gene node, the probability of connection between this gene node and the neighbor of another gene also exists, and the probability comparison of correlation exists, so it has strong aggregation in the gene complex network. e aggregation of the whole gene network is evaluated by the average aggregation coefficient. e average aggregation coefficient of the gene complex network is defined as the average aggregation coefficient of all gene nodes in the gene network. Equation (5) is described as follows: where |V| represents the number of gene nodes, C i represents the aggregation coefficient of gene node v i , and its value range is [0, 1]. When C � 0, there are no connected gene nodes in the gene network; whenC � 1, the gene network node is all connected. e average aggregation coefficient in the gene expression network describes the probability of connection between any associated genes in the gene network and reflects the closeness of the node relationship of the whole gene.

Construction of Complex Network for Gene Expression.
Gene chip expression matrix reflects different gene expression levels of different sequencing samples, which can be described by n × mmatrix, where X nm represents the expression level of the nth gene in m samples: In this paper, the WGCNA algorithm [22] is used to construct the gene expression network.
e WGCNA algorithm is based on a scale-free network. Whether there is a correlation between the two genes can be expressed by the correlation coefficient, forming the following formula: M ij represents the similarity of gene i and gene j in the expression amount of samples in the chip and M represents the similarity matrix in [0, 1]. In the construction of the gene expression matrix, the similarity is calculated through the expression of different genes in all samples. e similarity is an important measurement index to measure the strength between genes and is also the basis for the construction of complex networks. In the WGCNA algorithm, S ij is measured by weighting using a soft threshold, andβ is used to represent the soft threshold, while the general network structure adopts a hard threshold. e gene expression network using a soft threshold is more prominent, and its strong correlation is more prominent. Equation (8) is as follows: e weighting coefficient is defined according to scale network, and the correlation between log(k) of network node number K and log(p(k)) of node occurrence probability in a scale-free network is required to reach more than 0.8. Considering the correlation between genes and genes, the correlation between the individual genes and genes is expressed by the adjacency matrix, and topological overlap measure (TOM) is expressed by the strong relationship between genes and genes: en, set the correlation threshold λ (generally referring to the hard threshold) and compare it with the value of TOM matrix. If the matrix value in TOM is greater than λ, it indicates that there is a correlation between gene i and gene j (which can be expressed by adjacency matrix MJ ij � 1). If the matrix value in TOM is less than λ, it means that there is no correlation between gene i and gene j (which can be expressed by adjacency matrix MJ ij � 0). e generation of gene coexpression network according to the above principles is also the basis of this study and provides the basic conditions for analyzing core genes.

Convergence of PageRank Algorithm.
In the complex network of genes, there is a correlation between each gene Complexity and the genes related to it, but the influence of genes related to genes should also be considered. rough the correlation between genes, the PageRank algorithm is used to realize this correlation between genes, thus determining key genes.
Feature vector centrality and its variants are widely used. For example, the most famous PageRank algorithm [23] in the field of web page sorting is the core algorithm of Google search engine. At the initial time, each gene is given the same PR value, and then iteration is carried out, and the current PR value of each gene is divided equally to all genes it points to in each step. e new PR value of each gene is the sum of the PR values it obtains, so that the PR value of the node Pi at time t is defined as the following equation: where k out j is the output of node P i , a ij gene expresses the initial matrix of the correlation network, and PR iterates until the value of each node reaches stability.
e PageRank algorithm is applied in many fields. rough continuous iteration, a relatively stable value is reached. As the number of iterations increases, the error ε of the values generated before and after is expressed. When a reasonable error value ε is reached, the PageRank algorithm ends, as expressed by the following equation: eoretically, ε is determined according to the user's experimental results. When the error comparison is required to be small, e comparison is set.
e relevant literature of the PageRank algorithm does not give the relevant convergence proof process, only the iterative process is used to explain its convergence, and no relevant mathematical methods are used to explain it. e following author gives the relevant proof process.

Theorem 1. If the sum of the values of all columns of a gene correlation matrix is 1, then the matrix is a convergence matrix.
Proof in which n i�1 PR i (t) � 1; it is sufficient to prove that M t in formula (12) in which the sum of each column is 1 andm ii � 0, which is According to the disk theorem, M � (m ij ) has N disks whose N eigenvalues all fall on the plane: in whichλ i ∈ ∪ n i�1 D i (M),R i � n j�1 and j ≠ i |m ij |. In the matrix M, N disks of M given according to equation (14) are e N eigenvalues of M′ all fall in the union set of n disks on the complex plane. e disk eigenvalues are seen as shown in Figure 1. erefore, when m � t is a convergence matrix, the proof is complete.
λ i is in the white disk of Figure 1; we can look at N eigenvalues |λ i | ≤ 1 of M′, according to equation (13).
erefore, when lim t⟶∞ M t is a convergence matrix, the proof is complete.

Disk Analysis of Improved PageRank Algorithm.
e above PageRank algorithm is based on the traditional method to prove convergence. Because the traditional β l PageRank algorithm does not satisfy the characteristics of strong connectivity and traps, that is, some web pages do not have links and loops pointing to other web pages, there are 4 Complexity links pointing to them or links not pointing to other web pages. e improved PageRank algorithm is M satisfies the normalized matrix of the traditional PageRank algorithm. After β takes 0.8, the form of A matrix is as follows: At this time, the convergence of matrix A is proved with 0.2 * (1/N) as the center of the circle, as shown in Figure 2.
As can be seen from Figure 2, the center of the improved PageRank algorithm is 0.2 × (1/N), and the radius is 0.8 + 0.2 × (N − 1/N) or 1 − 0.2 × (1/N), so that the disk can pass through the point of 1, but the disk does not pass through the point of −1, so the maximum eigenvalue of the improved PageRank transfer matrix takes 1 and the minimum value cannot take −1. erefore, when t ⟶ ∞, λ t i ∈ 0, 1 { } is easier to converge.
According to the properties of the PageRank algorithm, R 1 + R 2 + R 3 + · · · + R n � 1 takes any polynomial n−1 term and combines it with the above: −R 1 + a 12 * R 2 + a 13 * R 3 + · · · + a 1n * R n � 0(1) After (n) terms are added, the above N terms are linearly independent. Equation (27) can be transformed into A ′ matrix is reversible, and the value of R is

Data Selection.
e SVM-REF algorithm was used to select 2105 genes and 265 samples (198 TNBC, 67 no-TNBC). By establishing decision trees, we can see the distribution of decision trees under 2105 genes for each sample, as shown in Figure 3.
As can be seen from Figure 3, GSM1974605 has outliers under 2105 genes, and the GSM1974605 sample can be deleted. After deleting outlier samples, it can be seen that they are roughly divided into two categories, and the distinction is obvious, indicating that 2105 gene selection is reasonably classified under 197 samples.
Selecting soft threshold B plays a very important role in forming TOM gene association matrix. In this paper, the optimal threshold is obtained through continuous iteration of the threshold. Since the scale-free network atlas structure can reach 0.8 or the average connectivity can reach below 100 when β < 15, if the above requirements cannot be met, the experimental results will be greatly affected. Figure 4 is the distribution diagram of the threshold value and average connectivity.
As can be seen from Figure 4, at that time, the network map structure could reach more than 0.9, indicating that the threshold value of this interval is ideal and the optimal value is the most 8. However, the average connectivity is less than 100 at 3, so β � 8 meets the requirements of the above conditions. e TOM matrix is established. e TOM matrix is an index value representing the correlation between genes and is converted into a connection matrix through the TOM matrix. e relevant thresholds of TOM matrix are screened and the thresholds are introduced to form different complex gene networks after continuous changes. As that threshold value � 0. 2186, a complex network of gene is shown in Figure 5; as that threshold value � 0.2664, a complex network of gene is shown in Figure 6.
As can be seen from Figures 5 and 6, when the threshold B increases, the complex network diagram gradually decreases, the network structure becomes smaller, and the network community will appear. Smaller outlier gene 6 Complexity     networks will also appear so as B increases to a certain value, the network structure changes from a large-scale complex network to a small-scale network structure.

Analysis of Complex Network Structure.
e analysis of complex network structure analyzes the structural changes of the gene complex network from the indexes of centrality, aggregation coefficient, central potential, and so forth. By introducing threshold λ to change the structure of the gene complex network, the network structure can be seen to change continuously by increasing the threshold λ, and the network changes from a large-scale structural model to a small-scale structural model. e specific effect is shown in Figures 7-10.
From Figures 7-10, it can be seen that, with the increase of λ, the structure of the gene network has changed greatly. With the increase of the performance of the network structure, the scale of the whole network is continuously decreasing, resulting in the continuous changes of various complex network parameters, the overall scale is small, and the network community is increasing, resulting in the continuous changes of key gene nodes. en, PageRank is used to calculate network structures of different sizes, and PageRank values of genes are sorted. Under different threshold settings, the PageRank values of each gene node are shown in Tables 1-4.  Tables 1-4 show the PR values of genes in the gene complex network under different λ values. With the continuous increase of λ, the number of genes in the gene complex network is continuously decreasing, and the expressed gene network structure is also in different states. e values of gene PR values in different networks are also constantly changing, and the ranking of gene PR values is also constantly changing. In each threshold, it is the calculation of different screening networks, which can be determined according to the size of the network or the number of genes. When the number of selected genes is large, you can choose λ � 0.2345 or λ � 0.2505. If λ is relatively small, the number of selected genes will be relatively large, and there will be more key genes. When λ is larger, the relative number is small, but the key genes are few. erefore, when choosing, you can choose several suitable values of λ for comparison. For example, a certain gene appears in several tables consecutively.
In Table 1 to Table 4, several TOP gene tables are ranked by PR value, and the top gene is the key gene. e PR values are given in the table, and the sum of the PR values of all genes is 1. Figure 11 shows the changes of several gene values (CDK13, DSPP, HLA.G, LINC00304, TPX2, FOXM1) and screens several genes with a higher ranking.
As can be seen from Figure 11, as the λ value and PR value change continuously, the whole PR value increases continuously. As the network scale becomes smaller, the PR value of gene nodes increases continuously. Some genes become outliers because the network scale is constantly changing, and gene nodes will not have new PR values in the later screening process, resulting in PR values of 0. 8 Complexity erefore, different gene complex network structure models will produce different gene structures, and selecting an appropriate gene structure network will obtain different key genes.
e key genes can be analyzed from the above, and the expression significance of the key genes screened will be explained by biological significance. e expression significance of some key genes is shown in Table 5.        Cyclin-dependent kinase (CDK) 12 and 13 are highly selective dual inhibitors that can inhibit triple-negative breast cancer (TNBC) cells. In terms of mechanism, the inhibition or loss of CDK12/CDK13 will cause the polyadenylation sites of introns to break, thereby inhibiting the expression of critical DNA damage response proteins. is causes the "BRCAness" phenotype, which leads to defects in DNA damage repair and promotes synergy with DNA-damaging chemotherapy and PARP inhibitors.

TPX2
TPX2 protein is highly expressed in breast cancer and has a certain correlation with the histological grade of breast cancer and lymph node metastasis. It may be a risk factor for the occurrence, development, and lymph node metastasis of breast cancer.

FOXM1
FOXM1 is considered as an emerging molecular target with its carcinogenic effect and 85% high overexpression characteristics in TNBC. FOXM1 occupies an important position in the gene network and is a key regulatory gene for breast cancer.
10 Complexity ere are many key genes that can be regulated, and only some are listed in Table 5. Other regulated genes can be determined according to the ranking of PR values in Table 1 to Table 4.

Conclusion
In this paper, by establishing a complex gene network, statistical analysis is carried out through gene nodes, and then PR values are calculated for each gene node and ranked statistically to obtain gene key nodes. Under different thresholds, the PR value changes of different gene nodes are taken, and the appropriate gene network is selected for screening. e method proposed in this paper can transform the expression amount of gene expression network into a complex gene network and then into PR value, thus obtaining key genes. e next step of research work is to identify key genes from RNA-Seq analysis of expression in the second-generation sequencing and to identify key genes in combination with SNP, InDel, and other variants. It can also consider using other methods [24,25] to solve the key genes.
Data Availability e raw data supporting the conclusions of this article will be made available by the authors, without undue reservation.

Conflicts of Interest
e authors declared that they have no conflicts of interest regarding this work.