An Early Intestinal Cancer Prediction Algorithm Based on Deep Belief Network

The incidence of colorectal cancer (colorectal cancer, CRC) in China has increased in recent years, and its mortality rate has become one of the highest among all cancers. CRC also increasingly affects people’s health and quality of life, and the workloads of medical doctors have further increased due to the lack of sufficient medical resources in China. The goal of this study was to construct an automated expert system using a deep learning technique to predict the probability of early stage CRC based on the patient’s case report and the patient’s attributes. Compared with previous prediction methods, which are either based on sophisticated examinations or have high computational complexity, this method is shown to provide valuable information such as suggesting potentially important early signs to assist in early diagnosis, early treatment and prevention of CRC, hence helping medical doctors reduce the workloads of endoscopies and other treatments.

exploring the mechanism behind the disease, in predicting and evaluating corresponding treatments, and finding new drug targets, thereby opening up new avenues for drug research and development 12 .
The medical industry has incorporated high tech solutions such as artificial intelligence and sensing technologies, making medical services increasingly intelligent. The recent policy of "New Healthcare Reform" in China has made intelligent healthcare care accessible to ordinary people. Intelligent healthcare aims to capitalize on artificial intelligence technology to assist in various types of medical decision making, including disease risk prediction, intelligent healthcare consultation, medical image analysis, electronic medical record information extraction, medical health data analysis, medical insurance evaluation, and making recommendations for medication. In 2017, Esteva developed a deep neural network that can successfully classify skin cancer from sample data 13 , demonstrating that deep learning methods have great potential for use in medical fields. Intelligent systems that can make early disease predictions or help provide information for doctors during the diagnosis process are valuable in both scientific research and clinical medicine.
In recent years, many research teams have attempted to pursue machine learning methods to classify cancer patients as high or low risk. These technologies can play important roles in research and treatment of cancer diseases 14 . The purpose of machine learning methods is to detect key features from complex sample data and to reveal their contributions. Machine learning methods such as artificial neural networks, Bayesian networks, support vector machines (SVM), and decision trees have been widely used in cancer research and provide effective and accurate basic models for early prediction of various types of cancers.
The dimensions of the sample data increase with the number of examination data items during the early diagnosis of cancer. However, because the specific examination items collected vary on a case-by-case basis, it is natural to see data sparseness in the constructed sample dataset. Consequently, the noise in the data also increases, which inevitably negatively impacts the performances of early CRC prediction algorithms. In addition, because of the high dimensionality of the sample data, the time complexity of traditional prediction algorithms is usually high. Therefore, we intend to devise a method to effectively address both data sparsity and high dimensionality and to eliminate noise in prediction problems, allowing us to learn which sample features play key roles in early CRC prediction.
Wang et al. defined the problem of feature selection as a combinatorial optimization or search problem in intelligent healthcare, rather than the commonly used filtering, packaging and embedded feature selection methods 15 . They applied several feature selection methods, including exhaustive search, heuristic search and hybrid methods. The heuristic search methods include feature ordering metrics either with or without data extraction. Kleftogiannis et al. combined an SVM with a genetic algorithm (GA) to perform feature selection and parameter optimization 16 . Duan proposed a backward elimination feature extraction method similar to the SVM recursive feature elimination method (SVM-RFE) 17 . The method classifies the feature ranking scores by statistically analyzing the weight vectors of the plurality of linear SVMs trained on subsamples of the original training data at each step. Zhong et al. used an SVM to analyze protein characteristics based on the Pearson correlation coefficient to eliminate redundant features 18 . Fong et al. combined the particle swarm optimization algorithm with three different classification methods-pattern network, decision tree and naive Bayes-to search for the optimal feature subset 19 . The results show that the method achieves high classification precision on specific datasets. Inspired by evolutionary algorithms, Mohapatra et al. proposed a modified cat swarm optimization (MCSO) algorithm to extract features from datasets, applied it to several biomedical datasets, and achieved favorable results 20 . Metsis et al. proposed a feature extraction method based on a structural sparse induction specification and compared it with existing feature extraction methods on four published ACGH datasets 21 . Boreto et al. proposed an analytical geometric feature extraction method to supervise variational correlation learning (suvrel) using a variational method that determines the tensor of the metric to define the distance-based similarity during pattern classification 22 . The variational method was applied to a cost function that penalizes the distance within the large class and the distance within the preferred class. Their approach yields a metric tensor that minimizes the cost function. Bennasar et al. introduced the joint mutual information maximization (JMIM) and the normalized joint mutual information maximization (NJMIM) methods, both of which use the maximum value of mutual information and minimum criteria, thus alleviating the theoretical and experimental overestimation of the meanings of features 23 . Xu et al. used the minimum redundancy maximum correlation (MRMR) metric, forward feature extraction and an SVM, and found that this combination outperformed other classifiers such as Bayesian decision theory, K nearest neighbor and random forest 24 .
In addition, to address the sparsity and noise of the data in such problems, the matrix decomposition technique is a commonly used method at present; its implementation is relatively simple and its prediction accuracy is relatively high. The most famous matrix decomposition methods include singular value decomposition (SVD) 25,26 , principal component analysis (PCA) 27 , independent component analysis (ICA) 28 , and others. Among these, SVD requires completing the data to avoid the sample sparseness problem; however, this operation not only increases the required data storage space but also potentially violates the practical significance of the sample data in a specific environment. Meanwhile, because SVD is a highly complex algorithm, it is not applicable to networks with large sample sizes. Therefore, based on SVD, Simon Funk proposed the LFM model by optimizing the diagonal array of the eigenvalues of the sample data matrix into a decomposed matrix by optimizing the evaluation index RMSE in the training matrix 29 . In real prediction systems, no uniform standard exists for each new data sample; therefore, Koren added the user's historical scores based on LFM and proposed the SVD++ model 30 .
However, the above series of feature extraction models do not consider the existence of negative values in the sample data. In a prediction system, negative values in the sample matrix have no practical meaning in a real situation. For example, during early cancer diagnosis, a certain patient attribute or a certain indicator with a negative value may be meaningless when reconstructing the sample data. Therefore, Lee and Seung proposed a nonnegative matrix factorization method (NMF) 31,32 , which finds the low rank of the matrix and then decomposes it into a nonnegative matrix. This method not only greatly reduces the dimensionality of the matrix but also removes redundant data, making the decomposed result more interpretable in practice. NMF technology has been widely applied in the health care 33 , medical imaging [34][35][36] and biomedical fields 37,38 ; however, this technology has not attracted widespread attention in early cancer prediction. Therefore, this paper integrates NMF and combines it with a deep learning method to facilitate early CRC detection.
Multiple examples of deep learning applications exist in medical research, most of which focus on automatically identifying tumor images or detecting gene sequences, and these algorithms have achieved good results. Xiao et al. developed a deep learning-based 5-class model to make cancer predictions using RNA sequence data 39 . Danaee et al. used a deep learning approach (a stacked denoising autoencoder) to analyze gene expression data and identify genes potentially correlated with breast cancer 40 . Some researchers have applied deep learning techniques to analyze cancer imagery. Bychkov et al. proposed a deep learning method to analyze CRC images, and their results showed that state-of-the-art deep learning techniques are able to extract more prognostic information from the tissue morphology of CRC than can an experienced medical professional 41  However, in real conditions, especially those in developing countries, examination data such as tumor imagery and genetic testing data are not easily obtained. Given the constraints on patients' economic and medical conditions, numerous patients do not have access to these techniques. In addition, test procedures such as tumor imaging and genetic testing are typically performed only for patients already strongly suspected of having cancer. Therefore, during the most important period (i.e., the prevention and early diagnosis period), these data provide minimal help. In this paper, we attempt to use the simplest and most commonly available test data-the medical examination report-to create a new prediction system to help doctors make decisions. The medical examination report is a basic test that almost every patient undergoes; thus, our early cancer prediction system can be applied to a broader range of patients.
CRC is a multifactor disease. In CRC prediction, combining data such as age, gender, family history of CRC, BMI, past history and other attributes and patient case reports using deep learning techniques in an expert system to predict the likelihood of early cancer will greatly reduce missed diagnoses by clinicians during endoscopy and treatment and will also provide effective help for early diagnosis, early treatment and prevention of CRC.
This paper explores and analyzes patient data from a deep learning perspective combined with patient attributes and case reports to construct an expert system to predict the probability of early cancer. Due to its relatively effective dimensional reduction and noise cancellation techniques, this method shows great promise for application in real scenarios. By greatly reducing missed clinician diagnoses during endoscopy and treatment, it will provide effective help for the early diagnosis, early treatment and prevention of CRC.

Results
The sample dataset includes each sample's attributes (e.g., age, gender, smoking history, and drinking history), endoscopic features (e.g., lesion location, polyp size, and no leaf) and blood attributes (e.g., white blood cells and hemoglobin). There are 50 features in all categories.
We compare early cancer prediction (ECP) using four classic machine learning algorithms, i.e., an (SVM), KNN, ensembles for boosting (EB), and random forest (RF), and three deep learning methods, i.e., a CNN, a recurrent neural network (RNN1), and a recursive neural network (RNN2). Each method's performance is averaged over 100 runs in which the data are randomly separated into a training set (containing 90% of the links) and a test set (including 10% of the links). Normally, precision and recall are not necessarily related; however, in large-scale datasets, these two indicators are correlated. A false negative example (FN) means that the prediction model incorrectly predicted a sample from the positive category as a negative category. Specifically, in this experiment, a FN means that a sample from a cancer patient was classified as being from a noncancer patient. In the clinic, the false negative rate (FNR) is important because it may lead to a missed diagnosis. Therefore, in this paper, we mainly use the F1_Score and FNR as the evaluation metrics of the algorithms. The experimental results are as follows: From Table 1, we can see that our ECP algorithm achieves the highest F1_Score on the real sample dataset. Both the Precision and Recall of our method outperform other algorithms. In addition, the FNR is the smallest among all algorithms. After dimensional reduction by a nonnegative matrix, we reduced the original 50-dimensional matrix to 14 dimension and extracted the hidden features. This idea facilitates effective early diagnosis, early treatment and prevention of cancer. Therefore, our algorithm not only reduces the spatial complexity of the sample but also achieves better prediction results. False negatives can also be caused by instability in the patient's condition, and related data may be collected during the window period of other diseases, resulting in data noise.
Next, we analyze the multidimensional features of the original dataset. In this paper, we input m attributes and n samples, where X ij corresponds to the j th attribute eigenvalue of the i th sample. Here, k is a hypothetical number of important features in the NMF, which is generally less than the number of attributes. After NMF decomposition, W ik corresponds to the correlation probability of the i th sample and the k th important feature, and H kj corresponds to the probabilistic correlation of the j th attribute and the k th important feature. The result of the NMF is as follows: (2019) 9:17418 | https://doi.org/10.1038/s41598-019-54031-2 www.nature.com/scientificreports www.nature.com/scientificreports/ We can see from Fig. 1 that after the nonnegative matrix decomposition the matrix retains the content of both the original matrix and the original X matrix in the dimensionally reduced W matrix. Finally, we construct a heat map of the properties of the H matrix in the nonnegative matrix decomposition and the k important features. We use the green block diagram to identify the most important attributes and features among all 50 attributes and the extracted 14 important features, as shown in Fig. 2 below: As Fig. 2 shows, factors such as gender, smoking history, drinking history, hypertension, diabetes, whether early cancer is present, whether multiple cancers are present, whether lobes are used, and whether thermal biopsy forceps are used all have a greater impact on the characteristics of the extracted features after dimensionality reduction. For example, in patients with early stage cancer, the polyps are relatively large; thus, they are easily detected by thermal biopsy forceps. The use of thermal biopsy forceps is correlated with the detection of early cancer.
To further compare the computational efficiency of these methods, the processing speed of each method was recorded and listed in the figure shown in Fig. 3. As shown in the figure, by averaging the runtime during the training and testing procedure over 10 realizations, we find that our proposed method ECP have a medium runtime compared with the other Deep Learning methods. The RNN1 and RNN2 methods were less efficient than the other methods, especially for the testing runtime.
In summary, this model can help to improve the efficiency of early cancer diagnosis. In contrast to conventional deep learning techniques that focus on image processing, which is usually highly time consuming, our algorithm uses a deep learning technique to analyze patient case reports. This approach not only reduces the spatial complexity of the sample but also achieves better prediction results. In addition, our model suggests that several items in the examinations, such as "smoking history", "drinking history", "hypertension", and "diabetes", are highly correlated with the occurrence of cancer. Discussion case source. The data are a collection of clinical patient records with intraepithelial neoplasia revealed by total colonoscopies performed at the endoscopy center of the First Affiliated Hospital of Nanjing University of Traditional Chinese Medicine (Jiangsu Provincial Hospital of Traditional Chinese Medicine) from February 2014 to February 2016. All the patients provided informed consent as follows: Before the study, the purported benefits and risks of the study, the endoscopic minimally invasive treatment method, its effectiveness, safety, and so on were explained to the patient, and if necessary, to family members; then, the patient or family signed a surgical consent form along with the informed consent form, and the hospital and patient each hold one copy of the forms. For hospitalized patients, doctors have the relevant healthcare records. The observations contained in these records are as follows: (1) patient name, gender, date of birth, birth place, contact information, contact address, height, weight, past history, and family history; (2) number of adenomas, lesions, size, shape classification, glandular opening pit pattern classification, lobulation, treatment, postoperative pathology, etc. To design the algorithm, we use the following data structure to store the sample data. We define A = {u i ,e j ,x ij } as the patient's sample data, where u i is patient i from the samples, e j is attribute j in the sample, and x ij is the value www.nature.com/scientificreports www.nature.com/scientificreports/ of attribute j from sample i. Assuming that the sample data include n patients and m attributes, the sample data constitute an n*m matrix X = [x ij ]. Figure 4 shows an example of a sample dataset.
The early cancer prediction method attempts to assign a tag y i ∈ {0, 1} to each new sample vector x i = {x i1 , x i2 , ..., x im } to be predicted. For the case sample, a 1 indicates that the prediction is early cancer, while a 0 indicates cases not predicted as early cancer. To test the accuracy of the algorithm, a sample dataset with known tags must be divided into a training set and a test set. Only the information in the test set is allowed to be used when calculating the labels for the predicted samples. Obviously, X X X train t est ∪ = , and . Each of the experimental results is averaged over 100 runs with randomly divided data where 10% of the entire dataset is used as a test set, and the other 90% of the data is used as a training set.
Algorithm evaluation. After designing the prediction algorithm, we need to evaluate its outcome.
Currently, the commonly used indicators for measuring the accuracy of such algorithms are accuracy, precision, recall, F1_Score and FNR. We used a 2 × 2 confusion matrix to describe the four possible prediction outcomes: a. A true positive (TP) means that the predictive model correctly predicted a positive category sample as a positive category.   (4) F1_Score combines the results of precision and recall; it is the weighted average of precision and recall. When the F1_Score is high, the test method can be regarded as effective. The F1_Score is defined as follows:  Table 2 describes the model predictions for all four possible outcomes as an example: Here, the accuracy is 0.91, which means that 91% (91 out of 100 samples) are correct. This might seem to be a good result; however, of the nine early cancer samples, only one of the nine cases was correctly identified as cancer. This result is not satisfactory because 8 of the 9 cancer cases were not diagnosed correctly. Therefore, when we use an imbalanced dataset (where a significant difference exists between the number of positive and negative category labels), accuracy alone does not reflect the true situation.
Precision is the ratio of the positive category in the sample identified as a positive category. In this example, we calculate that the precision equals 0.5; meanwhile, we find that the recall equals 0.11, the F1-Score equals 0.18, and the FNR equals 0.88. These two indicators show that the toy model used above performs rather poorly. Therefore, we can see that the F1-score and FNR metrics can be used to effectively evaluate the prediction model when the data samples are not balanced. www.nature.com/scientificreports www.nature.com/scientificreports/ the optimal choice of dimension. In the experiments, k is the dimension of the matrix attribute after dimensionality reduction using nonnegative matrices, that is, the number of important features to be extracted. Because the dimension of the original dataset matrix is 50, we gradually increase the dimension (k) of the nonnegative matrix after dimensionality reduction from 1 to 50. We find that the algorithm achieves its best performance when k = 14. Simultaneously, we also show the evaluation metrics of our method as k changes from 1 to 50 in the experiment. We calculate the variation of two evaluation metrics (precision and recall) with different dimensions of the input features. The results are shown in Fig. 5. the advantages of ecp. Compared with the other algorithms, negative values are not considered in dimensionality reduction because a negative value in the sample data matrix has no real-world meaning in early cancer prediction. For example, during prediction, if a negative value appears in the sample, the characteristics of the sample data will never be selected during the feature extraction process. However, this situation may not be correct because the feature may become significant and play a key role in the future. Our model not only reduces the dimensionality of the matrix but also removes redundant data, making the decomposed result more interpretable in practice.
In addition, because our sample data are small, an SVM can easily find a linear relationship between the data and the features for small and medium sample sizes, thereby avoiding the use of a neural network structure and its attendant local minimal value problems. The method is highly interpretable and can be used to solve high-dimensional problems. In addition, the algorithmic time complexity of linear SVM is significantly lower.
Methods ethics approval and consent to participate. The present study was approved by The Ethics Committee of the Affiliated Huaian Hospital of Xuzhou Medical University. All patients provided written informed consent before participating, and all the methods were conducted in accordance with the relevant guidelines and regulations.

Deep learning framework of early cancer prediction algorithm based on nonnegative matrix.
Based on the iterative method for NMF computing, we present an algorithm for early cancer prediction based on NMF, named ECP. The framework of our algorithm is shown below.  Table 2. An example used to describe different metrics, the actual values include 100 samples of cancer (positive category) or noncancer (negative category). www.nature.com/scientificreports www.nature.com/scientificreports/ Detailed algorithm steps. Data standardization. In the early cancer prediction algorithm, we need to process multidimensional patient sample data. First, we need to standardize the sample data. Data standardization is based on the column of the feature matrix for data processing. The Z-score standardization method, which standardizes the attributes of each dimension of the sample, is widely used in many deep learning algorithms. This method uses the mean and standard deviation of the data to standardize the data so that the processed data conform to a standard normal distribution, i.e., with a mean of 0 and a standard deviation of 1. After normalizing the data, the error caused by the different feature characteristics of each attribute cancel out, and the standardization is a linear transformation, which involves converting a certain characteristic attribute in the sample data according to its proportional compression. Data standardization can improve the performance of the data without having to change the numerical ordering of the original data. The specific standardized function is as follows: where μ is the mean of the attribute data for each column of the sample, and σ is the standard deviation of the attribute data for each column of the sample.
Eigenvalue extraction. To address the high dimensionality and redundancy characteristic of the sample data, we need to effectively reduce the dimensionality of the original network's sample matrix to remove redundant attributes. For example, certain factors (such as name, gender and age) exist in the dataset that we can reasonably believe would not provide a positive contribution to the prediction algorithm model. Therefore, we can use a method to remove these redundant attributes and improve the final accuracy of the prediction algorithm. Although some matrix dimensionality reduction methods have been used in cancer prediction, they do not consider the actual situation in clinical medicine. For example, during sample testing, blood samples will have only nonnegative values. However, common dimensionality reduction methods produce negative values in the data matrix of the sample after dimensional reduction, which is a nonphysical result. Meanwhile, because each feature is evaluated independently, such screening methods may fail to capture all the highly discriminative feature subsets, each of which is composed of less discriminative features.
Therefore, at the beginning of the algorithm, we use the NMF method as the matrix decomposition technique to reduce the dimensionality of the sample dataset and then approximate the original matrix using the decomposed matrix and the weight matrix to reduce the time and space complexities of the algorithm. In this paper, NMF is applied to the prediction of early cancer diseases as shown in Fig. 6. The correlation between the different types of matrices is reconstructed by projecting a high-dimensional vector space into a low-dimensional vector space. The algorithm reduces the storage space of the data while maintaining a low time complexity and can effectively improve the prediction performance.
The traditional dimensionality reduction method is used to statistically analyze only the sample attributes and data, without considering other information. NMF is different; it can often represent nonlocal correlations to obtain better prediction results. We can regard the sample matrix as a nonnegative feature matrix, where each row represents the eigenvector of a sample. The goal of NMF is to solve two nonnegative matrix factors W ∈ P n * k and H ∈ P k * m , (n + m) * k < nm so that the product of the two approximates the matrix X: where k represents the dimension of the low-dimensional space and W represents the low-dimensional space vector, called the base matrix. H denotes the coefficients of the vector product of the reconstructed original matrix, which is called a weight matrix. This decomposition problem is usually modeled as a Frobenius norm optimization problem: u v F , 2 in which the constraints ensure that all the elements of the matrix W, H are nonnegative. In this paper, we replace the original matrix abs with a coefficient matrix that reduces the dimensions of the original matrix X′ to k. This operation not only reduces the required storage space but also retains the intrinsic information of the data insofar as possible after dimensionality reduction.
Data division. After the NMF process, we randomly divide the obtained X 1 matrix into training data X 1 train and test data X 1 test . The training data X 1 train includes 90% of the data, and the remaining 10% constitute the test data X 1 test . It is important to note that we classify all records into two classes based on Y and randomly choose 90% of the records in each class to construct a training set to eliminate the imbalance effect of the sample data. The training data X 1 train are used to train the DBN in the next step, after which the test data X 1 test is used as the input to the DBN, which generates X 2 test used as input to the final SVM.
The prediction model based on DBN. Given an insufficient number of data samples, some conventional machine learning methods do not achieve good results. For example, traditional neural networks generally have one or two hidden layers because, after the number of neurons becomes too large, there are too many hidden layers; consequently, the number of parameters in the model increases rapidly, and the model training time becomes increasingly long. Additionally, in traditional neural networks, as the number of layers increases, it becomes difficult to find the optimal solution by using random gradient descent, and the model can easily become trapped in locally optimal solutions. Gradient dispersion and gradient saturation are also prone to occur during backpropagation, resulting in unsatisfactory model results. Under increasing numbers of neural network layers, deep neural networks utilize many model parameters, which requires large amounts of labeled data during training because it is difficult to find the optimal solution when the training dataset is small. In general, deep neural networks are not a good fit for solving small-sample problems. However, the DBN solves the problem of deep neural network optimization by adopting layer-by-layer training. Under layer-by-layer training, the entire network is given a reasonable initial weight; then, the optimal solution can be reached by simply refining the weights. Restricted Boltzmann machines (RBMs), which play an important role in the training process, are composed of visible layers and hidden layers. The visible layers accept input, and the hidden layers extract features. After training the RBM, the characteristics of the input data can be obtained, i.e., the invisible features of the input data are extracted. www.nature.com/scientificreports www.nature.com/scientificreports/ Because of the above characteristics of RBM, DBN layer-by-layer training is effective. The hidden layer extraction feature makes the training data of subsequent levels more representative, and the problem of insufficient sample size can be solved by generating new data.
DBN performs model training in two main steps: Step 1: separately train each layer of the RBM network in an unsupervised manner and ensure that the maximum feature information is retained when the feature vector is mapped to different feature spaces.
Step 2: Set the BP network as the last layer of the DBN, take the output feature vector of the RBM as its input feature vector, and train the entity relationship classifier in a supervised manner. Each layer of the RBM network can only ensure the weight value in its own layer. The feature vector mapping of this layer is optimal, whereas the feature vector mapping of the entire DBN is not optimal. Therefore, the back-propagation network also propagates the error information from top to bottom to each layer of the RBM and finally fine-tunes the DBN network. The RBM network training process can be regarded as the initialization of a deep BP network weight parameter, which allows the DBN to overcome the shortcomings of the BP network, where the latter falls readily into local optima and suffers from long training times due to the random initial weight parameters.
In this paper, we obtain the number of attribute features obtained by the nonnegative matrix decomposition as K = 14. After training the DBN, the last layer is our output feature. The dimension of the feature vector is the number of nodes in the last layer. The number of nodes is determined through parameter sensitivity experiments according to our data characteristics. We ultimately chose 4 as the number of nodes.
In early cancer prediction, we take the attribute vector V of each case sample after dimensionality reduction as the input of the DBN, as shown in Fig. 7. In this training phase, the visible layer input vector V is passed to the hidden layer. Conversely, the input V of the visible layer is randomly selected to attempt to reconstruct the original input data. Finally, these new visible neuron activation units reconstruct the hidden layer activation unit forward to obtain h 1 and h 2 . During the training, Gibbs sampling is performed to repeat the above process. The correlation difference between the activated units in the hidden layer and the input visible layer is used as the basis for the update of the weights W 1 and W 2 .  www.nature.com/scientificreports www.nature.com/scientificreports/  where g is the sigmoid function, which is defined as follows: Here, b i is the offset of the input layer, and a i is the offset of the hidden layer. Through this step, we derive the user's feature matrix X 2 as the input to the next classification model.

Prediction model using an SVM.
In the final step of early cancer prediction, we transform the prediction problem into a classification problem. The basic idea of the classification algorithm is to find the dividing hyperplane in the feature space based on the training set X 2 train that best separates the positive and negative samples. We map the original indivisible data to a new space and classify the converted data, as shown in Fig. 8.
We take the output feature matrix X 2 Train of the DBN in the previous step as the training set of the classification algorithm, i.e., X 2 Train = x i |i ∈ 1, ..., n, y ∈ 0, 1. Then, the linear SVM learns to obtain the separated hyperplane as follows: The two different classifications of the sample points closest to the separated hyperplane are called support vectors, and two long bands parallel to the separated hyperplane are formed. The distance from the hyperplane indicates the confidence of the classification; the greater the distance, the higher the confidence that the classification is correct. This value is easy to obtain by calculating the following: where a i ≥ 0 is the Lagrange multiplier. In the problem of early cancer prediction, we take the result X 2 of the DBN output in the previous step as the input of the SVM classification algorithm. Next, we obtain the training model. Finally, we obtain the prediction result (Predict_label) corresponding to the test set (test_data).