Based on biomedical index data

Abstract To explore the influencing factors of prostate cancer occurrence, set up risk prediction model, require reference for the preliminary diagnosis of clinical doctors, this model searched database through the data of prostate cancer patients and prostate hyperplasia patients National Clinical Medical Science Data Center. With the help of Stata SE 12.0 and SPSS 25.0 software, the biases between groups were balanced by propensity score matching. Based on the matched data, the relevant factors were further screened by stepwise logistic regression analysis, the key variable and artificial neural network model are established. The prediction accuracy of the model is evaluated by combining the probability of test set with the area under receiver operating characteristic curve (ROC). After 1:2 PSM, 339 pairs were matched successfully. There are 159 cases in testing groups and 407 cases in training groups. And the regression model was P = 1 / (1 + e (0.122 ∗ age + 0.083 ∗ Apo lipoprotein C3 + 0.371 ∗ total prostate specific antigen (tPSA) −0.227 ∗ Apo lipoprotein C2–6.093 ∗ free calcium (iCa) + 0.428 ∗ Apo lipoprotein E-1.246 ∗ triglyceride-1.919 ∗ HDL cholesterol + 0.083 ∗ creatine kinase isoenzyme [CKMB])). The logistic regression model performed very well (ROC, 0.963; 95% confidence interval, 0.951 to 0.978) and artificial neural network model (ROC, 0.983; 95% confidence interval, 0.964 to 0.997). High degree of Apo lipoprotein E (Apo E) (Odds Ratio, [OR], 1.535) in blood test is a risk factor and high triglyceride (TG) (OR, 0.288) is a protective factor. It takes the biochemical examination of the case as variables to establish a risk prediction model, which can initially reflect the risk of prostate cancer and bring some references for diagnosis and treatment.


Background
Prostate cancer (PCa) is a common malignant tumor of the genitourinary system in elderly men, and its incidence has obvious ethnic and regional differences. Scientific research shows that prostate cancer accounts for the first place in the total number of new cancers in the United States in 2019, and prostate cancer deaths account for 10% of all cancer deaths [1] Prostate cancer screening has always been a controversial health topic. According to the recommendations of the American Urological Association, prostate specific antigen (prostate specific antigen, PSA) screening was generally carried out in the mid-1990s. [2] However, in recent years, many national guidelines have been updated and revised in accordance with the recommendations of U.S. Preventive Services Task Force (USPSTF), which recommended against PSA screening and pointed out that patients may be over diagnosed and over treated. [3] Nowadays PSA has not shown the ability to discriminate clinically important cancers from low-risk tumors. The Prostate Cancer Intervention versus Observation Trial (PIVOT) trial showed no survival benefit from radical prostatectomy in men with PSA 10 mg/L, [4] which are based on D'Amico criteria as a combination of PSA < 10 mg/L, stage T2A and Gleason score 6. It also was reported that more than half of these men underwent unnecessary treatments in Australia. [5] Moreover, PSA screening as currently practiced in the United States provides little to no reduction in prostate cancer mortality or morbidity, does not decrease any-cause mortality, and results in substantial diagnostic and treatment harms and large health-care expenditures. The health importance of prostate cancer and the financial costs to society require improved detection and treatment strategies and more rational use of current options. Until then, men and their health-care providers can make a wise health-care choice by saying no to the PSA test. [6] Prostate cancer screening in China is still in its infancy, but the incidence is increasing rapidly year by year. Some prediction model has been set up, Zhu et al has tested these models that was overestimated by approximately 20% for a wide range of predicted probabilities. [7] Therefore, the construction of prostate cancer risk prediction model for Chinese is an effective and necessary method. There are also many reports about prostate cancer risk prediction in China, but the main factors considered are diagnosis and treatment methods, pathological grading, magnetic resonance imaging, fluorescent probe, etc. [8,9]

Objectives
And most of the models do not take into account common clinical indicators such as biochemical examinations. But it is convenient for us to obtain these results from blood test, we can highly increase the efficiency of screening prostate cancer by this innovative model. It may be the first one to use biochemical index to build a prediction model for prostate cancer. In addition, the co-variables of most models will affect the accuracy of the model in varying degrees. Through the innovative method of propensity score matching, this scientific study is expected to better match the case group with the control group, so as to make the co-variables more balanced this model can be used for identifying Chinese who are at a high risk for developing prostate cancer, as well as for cancer screening and developing preventive health strategies.
Artificial neural network, which was first proposed by David and James of San Diego State University. It theoretically proves that any continuity function in a closed interval constantly adjusts the connection strength criterion between neurons in the training of the network. So that the difference between the calculated output dependent variable vector of the network and the dependent variable vector of the known training sample is minimum (that is, the prediction effect is the best).

Source of data & sample size
The data of the study were collected from PSA and biochemical examination data of 3000 patients in the National Clinical Medical Science Data Center (301 Hospital), which including 2771 hyperplasia of prostate patients, 112 both prostate hyperplasia and prostate cancer patients and 117 prostate cancer patients. And 112 both prostate hyperplasia and prostate cancer patients recognized as prostate cancer patients with 117 prostate cancer patients are experiment group. 2771 hyperplasia of prostate patients are defined as control group.

Participants & missing data
With the help of Excel 2010 software to sort out the data, we selected a number of biomedical index variables with data, which the missing values are less than 5% of the original data, [10] and this experiment uses SPSS version 25.0 (SPSS Inc. Chicago, IL) software to deal with the missing values of a small number of variables with MCAR method, [11] and extracts relevant variables from the diagnostic information of PSA and biochemical tests.

Predictors
The propensity score matching was achieved on Stata [12] version 12.0 SE (Stata College Station, Texas 77845 USA). In the data after screening by SPSS software, 2771 non-prostate cancer patients were used as the control group, and 229 prostate cancer patients were entered into the Stata database as the experimental group. Based on whether the patient has prostate cancer as a grouping factor, the remaining variables [AGE WEIGHT Body Mass Index Apo lipoprotein A1 (Apo A1) Apo lipoprotein A2 (Apo A2) Apo lipoprotein B2 (Apo B2) Apo lipoprotein C2 (Apo C2) Apo lipoprotein C3 (Apo C3) Apo lipoprotein E (Apo E) albumin alkaline phosphatase lactate dehydrogenase creatine kinase creatine kinase isoenzyme (CKMB) triglyceride highdensity lipoprotein cholesterol low-density lipoprotein cholesterol sodium calcium inorganic phosphorus free calcium (iCa) potassium chlorine creatinine (Cre) total prostate specific antigen (tPSA) free prostate specific antigen] are included in the Ps model as covariates. The caliper value is set to 0.05 according to previous scientific research. In this example, it is set to 1:2 matching. After that, the results need to process the validation, which is called the equilibrium Test. If the bias were negative number, it is necessary to process the test again till the matched data fits well. [13]

Ethical statement
This study was approved by the Institutional Review Board of the Chinese PLA General Hospital. Participants' informed consent was waived by the institutional review board because this study involved routinely collected medical data that were anonymously managed in all stages, including the stages of data cleaning and statistical analyses.
2.5. Statistical analysis methods 2.5.1. Multivariable logistic regression analysis and artificial neural network. Study was used to set the non-prostate cancer patients in the matching database as the control group and prostate cancer patients as the case group. The case of prostate cancer was assigned a value of 1 and the case of no prostate cancer was assigned a value of 0. Based on the data, literature, [14][15][16] expert experience and clinical knowledge, the obtained patient information was screened: age, weight, body mass index, Apo A1, Apo A2, Apo B2, Apo C2, Apo C3, Apo E, serum albumin, alkaline phosphatase, lactate dehydrogenase, CKMB, triglyceride, high-density lipoprotein cholesterol and low-density lipoprotein cholesterol. These variables were compared and analyzed for the above-mentioned indicators of these 2 types of patients. Various factors related to prostate cancer were determined using univariable logistic regression analysis methods. Related factors (P < .05) were included in multivariable stepwise Logistic. Meanwhile, data with filtered variables were also included in artificial neural network for training and testing. After analysis, meaningful correlation factors were found, and the degree of impact on prostate cancer were based on significant differences.

Participants
After PSM, the result of the first equilibrium test show that the fitting effects of the variables Na, ALP, and LDLC are not good. As shown in Table 1

Model specification
Matched data were entered into univariable logistic regression to screen variables. And pick up which one has statistical significance (P < .05). The analysis results are shown in Supplementary Table 2, http://links.lww.com/MD/G38. As for PSA and calcium parameter, only one of each type has been selected. The above table shows that the total PSA is greater than the free PSA in the Wald test. And in previous the clinical analysis, [17,18] total PSA always reflects the serum antigen value, so tPSA was included in the model and free PSA was excluded. But based on numbers of studies, [18,19] iCa was included in the model, which has not been included in prediction model before. [20] In summary, the variables included in the model at
The new model was compared with the simulated probability of total diagnosis of prostate cancer only by PSA, and the diagnostic efficiency was judged according to the ROC curve ( Fig. 1). It can be seen that the prediction efficiency of the new model (measure of area: 0.963) is significantly higher than that of only PSA (measure of area: 0.785) as a single factor.
Data with 9 related factors were resulted from univariable logistic regressions were put into artificial neural network model for training and testing. The results are shown in Figure 2.
In terms of model specification, artificial neural networks require no knowledge of the data source but, since they often contain many weights that must be estimated, they require large training sets. [21,22] The system will randomly select 71.9% of the cases as the training set for modeling and 28.1% of the cases as the test set to test the quality of the model. [23] There is no significant difference between the verification model and the test model (94.8%, respectively, 91.8%), so there is no overtraining in the model, and the importance of each variable is similar to that in the logistic model At the same time, compare the logistic regression model with the artificial neural network model to simulate the diagnosis efficiency, and judge the diagnosis efficiency according to the

Discussion
With more debate on the accuracy of the PSA screening on prostate cancer, a large number of researchers found that PSA screening may lead to overtreatments and overdiagnoses. This concern driven the process of not only new diagnostic and prognostic tools but also models to predict the risk of prostate cancer. Based on the fundamental realities of the country, this Table 2 Multivariable regression analysis. Step Relative variable b S.E Wald P OR Step The findings of this study will be helpful in deciding on future health policies and preventive strategies for prostate cancer in China. This study is the first to develop a risk prediction model of prostate cancer, based on biomedical information. With the 96.3% diagnostic efficiency of logistic regression model and 98.3% that of neural network model, our model is an excellent discriminator, compared with former models, including those that combine PSA values with PSA relatives and prostate volume.
Former prediction models for prostate cancer have been reported. [24,25] Most of previous models have concentrated on the PSA test for prostate cancer screening and ignore that the cutoff values of specificity and sensitivity are indistinct. High predictive accuracy and discrimination is completed in several models, for instance, Prostaclass I (AUC, 0.79), Chun (AUC, 0.76), Karakiewcz (AUC, 0.74), and Finne (AUC, 0.74). [26] But they still limited in high probability of over diagnosis and overtreatment. [27] In the case of the absence of PSA screening in China, multiple other additive parameters such as MRI, DRE, and prostate volume were added to increase the predictive accuracy of PSA testing in the developmental prediction model.
This study not only adds the PSA also include multiple biomedical parameters, which are easily obtained in Chinese blood test report. Relevant reports are few, but risk factors above are demonstrated in experiments that are explored by researchers.  [28,29] which have also been demonstrated in the experiment, some risk factors that are still controversial also be found in this experiment. Apo lipoprotein E is also an important cholesterol regulatory protein. The main genetic subtypes of Apo lipoprotein E in the body are E3/E3, E3/E4, E2/E3, E2/E2, and E4/E4. The Apo lipoprotein E is also an important cholesterol regulatory protein.
At present, studies at home and abroad have shown that the relationship between the invasion and Gleason score of prostate cancer cells and their genotypes in vivo is controversial. In an earlier study, Liu et al, through a case-control study, [30] indicated that the E4 genotype and its allele were not associated with the pathogenesis and prognosis of prostate cancer, but could not explain the experiment conducted by Ifere et al [31] to prove that Apo lipoprotein E2/E4 is a risk factor for prostate cancer. In recent years, Yencilek et al [32] believe that the presence of E4 may reduce the possibility of prostate cancer, but still believe that E3/E3 is a major risk factor for prostate cancer and affecting Gleason scores. Recently, a research by Asare et al, [33] had proved Apo E could potentially be a discriminating biomarker for prostate cancer. Our study has supported this opinion.
Logistic stepwise analysis showed that Apo lipoprotein E was a risk factor for prostate cancer, with an OR of 1.535. From the side to verify the relationship between Apo lipoprotein E and prostate cancer, in clinical work, can guide patients to do some genetic tests, in order to better diagnose and guide the next treatment.
Besides, in the past, there may be not experiments that have stated Apo lipoprotein C2 and Apo lipoprotein C3 are associated with prostate cancer. This model probably the first 1 to demonstrated that Apo lipoprotein C2 is the protective factor and Apo lipoprotein C3 is the risk factor of prostate cancer, with OR values of 0.797 and 1.086, respectively. Its internal mechanism we guess may be that tumor patients accelerate the decomposition of en-dogenous lipids and the transformation and

Triglyceride and high density lipoprotein cholesterol.
TG provides essential fatty acids in lipid metabolism, which still remains controversy among scholars. Allot et al [34] believe that the increase of serum triglycerides is related to the occurrence of prostate cancer. However, Asare et al [33] have not found any significance difference with TG between Benign prostatic hyperplasia and prostate cancer. High density lipoprotein cholesterol (HDL-C) is an anti-atherosclerotic lipoprotein that transfers cholesterol from extra hepatic tissue to liver for metabolism. A case-control study conducted by Magura et al showed that [35] High TC (total cholesterol), high LDL-C (low density lipoprotein cholesterol), and low HDL-C may be risk factors for prostate cancer. However, the discussion on the relationship between blood lipids and prostate cancer is still controversial. In general, No experiments based on Chinese people have been created in order to study on the internal association between TG, HDL-C and prostate cancer. Consistent with the results of this study, triglyceride and high density lipoprotein cholesterol were protective factors for prostate cancer, with OR values of 0.288 and 0.147, respectively, reflecting that low triglyceride and low high density lipoprotein cholesterol increase the risk of prostate cancer.

Free calcium.
Calcium ion is an indispensable ion for maintaining normal physiological activities of the body, and it is very important for the regulation of electrical activity on both sides of the cell membrane. At the same time, calcium intake can affect the signal transduction pathway, promote the secretion of vascular endothelial factor and increase hypoxia inducible factor. In the last century, X-ray microanalysis has been performed on freeze-dried cryosections of normal, hyperplastic, and neoplastic human prostate, studies had found that calcium is the major prostate acinar cell cation. [36] In recent years, a number of studies at home and abroad have also shown that calcium-binding proteins can activate a variety of pathways to promote the spread of invasive prostate cancer cells. [37,38] In addition, our experiment results, which have shown, high iCa may avoid calciumbinding proteins creating.
In this experiment, as a protective factor of prostate cancer, the OR value of iCa is 0.002, which is of little statistical significance and has little guiding significance for clinical work, but it has a certain enlightening effect on scientific research.

Creatine kinase isoenzyme.
Creatine kinase isoenzyme (CK-MB) is mainly used in the diagnosis of myocardial infarction. However, in the early years, A Gries et al [39] accidentally found that the number of CK-MB may be related to prostate tumors. Since then, based on the continuous development of proteomics, many scholars [40,41] have suggested that CK-MB as a marker of malignant tumor should be included in clinical screening. Up to 2015, there is no systematic review or clinical application report on the false increase of CK activity caused by other CK-MB isozymes in malignant tumors. [42] In this study, CKMB is a risk factor for prostate cancer, the OR value is 1.086, the increase of CKMB will increase the risk of prostate cancer. This suggests that researchers should study and develop new indicators about CK-MB, and provide evidence for previous experiments.

Implications
Clinically, researchers should also pay attention to patients' cardiovascular disease and make a timely distinction from prostate cancer. Furthermore, latest research indicates that has-miR-940 act as a diagnostic and prognostic tool for prostate cancer. [43] Besides, ix co-expressed miRNAs (hsa-miR-17-3p, À377-3p, À410-3p and À495) and p2 miRNA panel (hsa-miR À377-3p, À410-3p, À27a-3p, 149-5p and 940) mainly associated with prostate cancer. [44] In other aspects, respect is a noninvasive, label-free, laser-based technique that identifies molecular composition of tissues and cells, which experiments have demonstrated that such technique could provide insight into different pathways leading to pre-cancerous anal squamous intraepithelial lesions. [45] It is believed that can also be extended in several carcinoma, including prostate cancer. Meanwhile, previous studies had indicated that bone scan-negative patients with a relatively high PSA level and velocity, the risk of distant disease is much greater, and PET imaging [46] may serve as a useful whole-body staging method. [47] Now more tracers for PET/CT are shown to be more accurate in the detection of recurrent disease as compared with radiolabelled choline PET/CT. [48] It is exciting for the clinical doctors to improve the efficiency of diagnostic tools in the future. In the level of genomics, researchers suggested that variations in tumor epigenetic landscape of individuals are partly mediated by genetic differences, which may affect prostate cancer progression. [49] It inspires us that these results could be applied in clinical practice that is helpful to distinguish indolent prostate cancer from advanced disease.

Conclusion
In this study, the innovative use of propensity score matching method reduces the differences between groups in the data, makes a better comparison between groups. In addition, this experiment also introduces the neural network model to improve the adaptability of the model to the nonlinear relationship between different complex variables. Among them The logistic regression model performed very well (ROC, 0.963; 95% confidence interval, 0.951-0.978) and artificial neural network model (ROC, 0.983; 95% confidence interval, 0.964-0.997) The most important was that Apo lipoprotein E, Apo lipoprotein C2, Apo lipoprotein C3, Triglyceride, High density lipoprotein cholesterol iCa and CKMB , are risk factors related to prostate cancer that have never been discovered or disputed, increase the trust of the known evidence or point out the direction for future research. What is more, increasing the apo test in the physical routine examination is a better way to improve the accuracy of the prostate cancer screening.

Limitation
This study also has many shortcomings: the final model does not involve pathological diagnosis, MRI imaging, Gleason score, digital rectal examination and other strong pathological factors as risk factors, since all data were collected through routine physical examinations; the short periods between risk measure and incidence of prostate cancer identification; and the exclusion of additional unmeasured or unexamined variables. Besides, the number of patients need to be increased and expected to do conduct a multi-center study. In addition, there are not many related research reports in China, and the experiments based on a certain factor are not convincing, and more experts and scholars are needed to provide external medical record data to verify the advantages and disadvantages of the model. For propensity score matching, Propensity score matching cannot assess and balance all possible outcome-influencing factors, [50] such as the LDL, several researches have indicated the pathways that are activated by LDL. [51] But the LDL has been excluded in this model. Furthermore, the availability of clinical practices based on a large number of validations to test. [52] Despite these limitations, this is the first significant study of clinical prediction modeling assessing the incidence risk of prostate cancer by biomedical parameters in China. Besides, these new parameters that were digged in this study also inspire us to explore the inner connection and molecular functions between the biomedical index and prostate cancer. In order to better apply this model and related research to China's domestic clinical work. Up to now, there are a lot of risk calculators have been set up, [53] we hope more and more scholars to work in this area.