Differential Diagnosis Model of Hypocellular Myelodysplastic Syndrome and Aplastic Anemia Based on the Medical Big Data Platform

The arrival of the era of big data has brought new ideas to solve problems for all walks of life. Medical clinical data is collected and stored in the medical field by utilizing the medical big data platform. Based onmedical information big data, new ideas andmethods for the differential diagnosis of hypo-MDS and AA are studied. The basic information, peripheral blood classification counts, peripheral blood cell morphology, bone marrow cell morphology, and other information were collected from patients diagnosed with hypo-MDS and AA diagnosed in the first diagnosis. First, statistical analysis was performed. Then, the logistic regression model, decision tree model, BP neural network model, and support vector machine (SVM) model of hypo-MDS and AA were established. The sensitivity, specificity, Youden index, positive likelihood ratio (+LR), negative likelihood ratio (−LR), area under curve (AUC), accuracy, Kappa value, positive predictive value (+PV), negative predictive value (−PV) of the four model training set and test set were compared, respectively. Finally, with the support of medical big data, using logistic regression, decision tree, BP neural network, and SVM four classification algorithms, the decision tree algorithm is optimal for the classification of hypoMDS and AA and analyzes the characteristics of the optimal model misjudgment data.


Introduction
Myelodysplastic syndrome (MDS) is a clonal disease of acquired hematopoietic stem/progenitor cells, which is transformed into clinical features by myelocyte hematopoiesis and high risk to acute myeloid leukemia [1].Some patients with MDS have low bone marrow hyperplasia, called hypocellular myelodysplastic syndrome (hypo-MDS).Hypo-MDS is a special type of MDS, accounting for 8.2%-29.0% of the total number of MDS, up to 38.0% [1].Aplastic anemia (AA) refers to the primary bone marrow hematopoietic failure syndrome.The etiology is unknown, mainly manifested as low bone marrow hematopoietic function and complete blood cell reduction.Clinically, there may be bleeding and infection performance [2,3].
At present, the differential diagnosis of hypo-MDS and AA is mainly carried out by hematology, cell morphology, bone marrow biopsy, and cytogenetics.In different stages of disease development, the peripheral blood of patients with hypo-MDS and AA may be reduced in one line, two lines, or three lines simultaneously [4,5].Pathological hematopoiesis is a major indicator of clinical diagnosis of hypo-MDS, but it has the disadvantages of poor reproducibility, poor specificity, and low sensitivity.Furthermore, pathological hematopoiesis can also be seen in some patients with AA [5].Some studies have also found that there is no pathological hematopoietic MDS [6].These show the nonspecificity of pathological hematopoiesis.Previously, cytogenetic abnormalities were considered to be reliable diagnostic criteria for hypo-MDS, but the detection rate of chromosomal abnormalities in MDS patients ranged from 40% to 60% [7,8] and even lower in hypo-MDS patients [9].It can be seen that the abnormal cytogenetic ratio of MDS is not very high, suggesting that the index is not specific.In recent years, the value of flow cytometry (FCM) in the differential diagnosis of AA and hypo-MDS has become increasingly important [10,11], but the differential diagnosis of hypo-MDS and AA with a single immunophenotypic marker is too low.The use of FCM to assess erythroid malignancies (a milestone in the diagnosis of MDS morphology) is difficult, limiting the widespread use of FCM in the diagnosis of MDS [12].
It can be seen that the pathological features and clinical manifestations of hypo-MDS and AA are very similar, and there are many differential diagnosis indicators, but the specificity is not high and the differential diagnosis of these two diseases is difficult in clinical practice.Every diagnosis process of disease will produce a large amount of data, and the data contain a lot of information about the disease.Therefore, using the collected big data for data mining, we can effectively analyze the disease.
Data mining refers to the process of extracting knowledge and information that has potential application value from large databases.It is a new type of the information processing system that has been rapidly developed in recent years [13].Classification is a very important task in data mining.Commonly used methods include logistic regression, neural networks, decision trees, and SVM.Each of these methods has its own characteristics, has a strong representation in the classification algorithm, and has been widely and successfully applied in the medical field [14][15][16].
Some scholars have compared the classification effects of data mining classification methods in the medical field.For example, Agarwal et al. [17] compared the Bayesian, SVM, and decision tree classification results using medical data.The results show that the SVM has the highest classification accuracy.Heydari et al. [18] compared neural network, SVM, decision tree, and Bayesian methods in the diagnosis of type 2 diabetes and found that the highest accuracy of the neural network model is 97.44%, the decision tree is 95.03%, and the Bayesian network is 91.60%, while the accuracy of SVM is only 81.19%.Lui et al. [19] used SVM, Bayesian networks, radial basis neural networks, and multilayer perceptrons to establish a classification model of magnetic resonance features of mild traumatic brain injury.The highest accuracy rate is the radial basis neural networks (74%); the worst is the multilayer sensor (66%), and SVM and Bayesian network are 70%.Tseng et al. [20] used decision trees and neural network methods to analyze the prognosis of oral cancer patients and found that both methods had higher accuracy, but compared to neural networks, the results of the decision tree model are easier to explain and easier to accept.Wu et al. [21] compared the classification performance of the BP neural network and logistic regression and found that the classification accuracy of the BP neural network (93.5%) was higher than that of the logistic regression model (90.7%).
Based on the current clinical problems of differential diagnosis of hypo-MDS and AA, the case data of hypo-MDS patients and AA patients were analyzed, the data that did not conform to the actual and the errors were deleted, and the pure data was obtained.Then, applying pure big data to the data mining algorithm was done to compare the effects.In this study, logistic regression, decision tree, BP neural network, and SVM are used to establish the differential diagnosis model of two diseases.Through the evaluation of the model, a better classification model is finally obtained, combined with the clinical features of the misdiagnosed cases of the best differential diagnosis model, and the combined differential diagnosis is performed.This provides an effective new idea and method for the differential diagnosis of hypo-MDS and AA.

Medical Big Data Acquisition and Storage
System Based on the Medical Big Data Platform  1).Medical big data platform is an open source, distributed storage, distributed computing platform that extends a single server to a cluster machine, with each node providing local computing and storage without relying on hardware for high availability.As the core component, MapReduce is used to implement task decomposition and scheduling.MBDP is used to store massive amounts of data.By storing the medical big data of clinical patients in real time and further effectively calculating and processing, the application value of the medical big data platform is fully utilized.2 Complexity balancing ability is poor, NameNode single point failure, Job-Tracer load is too large, small file problem, hot spot problem, etc.Both seriously restrict the further development of MBDP.

Research on
In order to achieve higher storage efficiency and more optimized load balancing capabilities for MBDP, an improved solution for MBDP is Noah.The management of the section is done by the mapping file to each node of the cluster, which solves the performance bottleneck problem of the central node (see Figure 2).The experimental results show that Noah improves the data recovery speed of MBDP while ensuring the security of cluster data, optimizes the load balancing capability of MBDP, and reduces the overall storage cost of the medical big data platform.This has obvious implications for improving the actual operational efficiency of the medical big data platform and its associated cloud computing architecture.

Storage Platform Framework
Based on Hypo-MDS and AA Case Big Data.According to the timeliness and large reserves of hypo-MDS and AA case data, the medical big data platform distributed storage system is placed in the virtualization pool of the resource management platform, with the medical big data platform slave node deployed dynamically, and the medical big data platform distributed storage is quickly built.
The newly built big data storage platform has good compatibility and long life cycle.The medical diagnosis process data is stored in the platform in real time to realize data analysis and processing.In the process of data storage, patient's basic information, peripheral blood classification count, peripheral blood cell morphology, bone marrow cell morphology, and other quantifiable data are included.Further interface with the classification system to achieve differential diagnosis of hypoplastic myelodysplastic syndrome and aplastic anemia is needed.Disease Hospital.A medical information database was made to collect basic information and medical history of eligible patients including the patient's gender, age, occupation, marital status, and smoking and drinking history.And the clinical examination data of the patients, including the peripheral blood classification count, peripheral blood cell morphology, and bone marrow cell morphology, were also collected.(2) Have a history of malignant tumors 3.2.Hypo-MDS and AA Diagnostic Criteria [2].Hypo-MDS and AA disease diagnostic criteria overlap in hematology and cell morphology such as peripheral blood cell reduction and bone marrow hyperplasia [22].How to distinguish between the two is often a big problem that plagues clinicians.The use of data mining methods to apply the collected data to the differential diagnosis of low proliferative myelodysplastic syndrome and aplastic anemia will greatly improve the accuracy of diagnosis.

Hypo-DMS Diagnostic Criteria.
Hypo-MDS has so far no unified diagnostic criteria.The reference conditions for hypo-MDS diagnosis are as follows.( 1) Peripheral blood showed more than two series of cytopenias, and the original cells or nucleated red blood cells could be seen in the classification.(2) Bone marrow smears show hyperplasia at more than two sites.(3) Bone marrow sections show a decrease in the bone marrow hematopoietic area and bone marrow cell volume, less than 30% for those under 60 years old and less than 20% for those over 60 years old.(4) The bone marrow has a pathological hematopoiesis in one or both blood cells, and the number of primitive cells varies depending on the MDS subtype.

AA Diagnostic Criteria.
The AA diagnostic criteria include the following data: (1) the reduction of whole blood cells, the percentage of reticulocytes < 1%, and the increase of the proportion of lymphocytes; (2) generally without hepatosplenomegaly; (3) reduced hyperplasia of bone marrow (<normal 50%) or severe reduction (<normal 25%), decreased hematopoietic cells, increased proportion of nonhematopoietic cells, and empty bone marrow granules (bone marrow biopsy shows that hematopoietic tissue is reduced); (4) can exclude other diseases that cause pancytopenia, such as PNH, acute hematopoietic function arrest, megaloblastic anemia, myelofibrosis, and acute leukemia.China University of Technology were selected as the study subjects.A total of 325 cases of AA patients were collected, among which 118 were not diagnosed for the first time and 51 cases were incomplete.A total of 156 AA patients entered the study.We collected 162 cases of patients with hypo-MDS, of which 19 were not first diagnosed and 13 cases were incomplete.In total, 130 patients with hypo-MDS entered the study (see Figure 3).
In terms of occupational composition of patients, the patient populations of the two diseases are mainly concentrated in workers, farmers, and students.The difference in occupational composition between the two diseases was statistically significant (χ 2 = 49 87, P < 0 001).The proportion of farmers with hypo-MDS is the highest (38.46%), while the percentage of students with AA is the highest (51.92%) (see Table 1).

Results of Laboratory Tests in Two Groups of Patients.
Peripheral blood cell counts, blood smears, and bone marrow smears were analyzed in 130 patients with hypo-MDS and 156 patients with AA.Blood cell counts showed that the red blood cell content and hemoglobin content in hypo-MDS patients were lower than those in AA patients, and the difference was statistically significant (P < 0 05).The platelet content of patients with hypo-MDS was lower than that of patients with AA, but there was no significant difference between the two groups (P > 0 05).
Blood smear showed that the proportion of neutrophils in rod-shaped nucleus was lower in hypo-MDS patients than in AA patients, and the proportion of mature lymphocytes was lower in AA patients than in AA patients (P < 0 05).The proportion of neutrophilic neutrophils and mature mononuclear cells in patients with hypo-MDS was higher   4 Complexity than that in patients with AA, but there was no significant difference between the two groups (P > 0 05).
The morphology of myeloid cells showed that the proportion of precocious neutrophils, late neutrophils, neutrophils, polymorphonuclear neutrophils, and mature lymphocytes was lower in patients with hypo-MDS than in patients with AA.The proportion of early red blood cells, medium and young red blood cells, late young red blood cells, and mature plasma cells is higher in hypo-MDS patients than in AA patients.And the difference was statistically significant (P < 0 05).The proportion of mature monocytes in patients with hypo-MDS is higher than that of patients with AA.The proportion of neutrophils and rod-shaped nuclear neutrophils is lower than that of AA patients, but the difference was not statistically significant (P > 0 05) (see Table 2).AA is statistically significant, there is no evidence that the prevalence of hypo-MDS and AA is related to occupational factors, so occupational factors are not included in the establishment of the model.Red blood cells and hemoglobin in blood cell counts were included in the establishment of the model as a basic reference for the differential diagnosis of clinical hypo-MDS and AA.There is a literature supporting [23] that neutrophils, precocious erythroblasts, medium and young erythrocytes, late erythroblasts, mature lymphocytes, and mature plasma cells contribute to the identification of hypo-MDS and AA, so these indicators were also included in the model (see Table 3 for variable assignments).

Decision Tree-Based Differential Diagnosis Model
4.1.The Establishment of a Decision Tree Model.The decision tree [24] is a layered rule of a tree structure formed by a topdown transfer method by determining a series of logical branch relationships.The root node, intermediate nodes, and leaf nodes are generated in the decision tree generation process.The root node, intermediate nodes, and leaf nodes are generated in the decision tree generation process.The root node of the decision tree is the beginning of the decision tree.It represents the most distinguishing feature variable of the sample data.Then, the feature classification point of the node was selected to split the node until the data of a certain node only belongs to one category or the variance is the smallest, and the node will not split.
The key issue of decision tree generation is the selection of the most partitioned attributes, namely, the selection of node features and feature splitting points.As the decision tree continues to grow downwards to generate various branch nodes, we hope that the samples contained in each node belong to the same category as much as possible, that is, the impurity of the growing nodes of the tree is getting lower and lower.According to different decision tree algorithms, there are three methods used to measure the degree of node impurity [25,26]: information gain, gain ratio, and Gini index.
The C5.0 algorithm in the decision tree model often uses information gain to select node features and feature split points.The calculation method is as follows.Information entropy is an indicator used to describe the purity of sample data.Assume that the relative frequency of c samples in sample data set I is p c (c = 1, 2, … , γ ), then, the information entropy of I is The smaller the Ent I , the higher the purity of I.When sample data is evenly distributed in each category, the maximum entropy log 2 C is used to indicate the lowest purity.When all samples belong to the same category, the information entropy has a minimum value of 0, indicating the highest purity.
Assume that a is the attribute of the sample data set I, there are V possible values a 1 , a 2 , … , a V ; then, we can use the attribute a to make a V branch nodes after zapping the sample data set I. We note that in sample data set I contained in the v branch node, all samples on a that have an a v value are I v .Therefore, the information gain obtained by dividing attribute data set I with attribute a is In general, the greater the information gain, the greater the purity of the division of the sample data set by the attribute a.Therefore, the information gain can be used to select the division attribute of the decision tree.

Complexity
The common gain rate of the C4.5 algorithm in the decision tree model is used to select node features and feature splitting points.Using the same sign as the information gain calculation, the gain rate is defined as where This is called the intrinsic value of the attribute a.The more possible the value V of the attribute a, the larger the value of the DV a will generally be.
The CART algorithm in the decision tree model uses the Gini index to select node features and feature splitting points.Using the same sign as the information gain calculation, the Gini index of sample data set I can be expressed as From the sample data set I, randomly selected two samples, according to the above formula, can be obtained, Gini D reflects the probability of inconsistency between the two random sample categories.Thus, the smaller the Gini D , the higher the purity of the sample data set I.
The Gini index for attributes is defined as Therefore, we choose the attribute with the smallest Gini index as the optimal partition attribute in the candidate attribute set A, namely, 4.2.Pruning of Decision Trees.In the top-down generation process of decision trees, overfitting often occurs if there is no restriction on its growth.At this point, the decision tree needs to be pruned to correct overfitting.The pruning of decision trees cannot be arbitrarily done, and it often needs to take into account the prediction accuracy and complexity of the decision tree; otherwise, it will cause decision loss.
Pruning is divided into prepruning and postpruning according to the time of pruning [27].Prepruning occurs during the growth of the decision tree and is estimated before the node is divided.If the division at this time does not improve the performance of the decision tree, then the partitioning is stopped and the decision branch of the decision tree is reduced.After the pruning occurs after the completion of the growth of the decision tree, the nonleaf node is evaluated.
If the subtree under the node can replace the leaf node to improve the performance of the decision tree, it is pruned to prevent overfitting.

Establishment of Hypo-MDS and AA Decision Tree
Models.The sample big data is partitioned; the training partition is 73% in the model establishment process, and the test partition accounts for 27%.The C5.0 algorithm is used to select the boosting method and cross-validation.The pruning severity is set to 75, and the minimum number of records per subbranch is 2. The global pruning is chosen to establish a decision tree model for the two diseases.The model of the training set was 209 cases: 199 cases were correctly classified and 10 cases were misclassified.The test set samples were 77 cases: 62 cases were correctly classified and 15 cases were misclassified (see Table 4).Sensitivity, specificity, Youden index, positive likelihood ratio, negative likelihood ratio, AUC, accuracy, Kappa value, positive predictive value, and negative predictive value of the model classification were evaluated (see Table 5).The dendrogram depth is 8, and there are 9 layered nodes.The proportion of late erythroblasts in bone marrow cells is used as the root node to develop the growth of the decision tree.After the growth of the decision tree is completed, we can extract valid information according to the decision rules of the decision tree, in order to achieve the purpose of identifying hypo-MDS and AA (see Figure 4).For example, the decision message passed to us by node 4 is that the percentage of late erythroblasts in bone marrow cells is less than 26.50% and that of peripheral blood red blood cells is greater than 1.36%.When the age is less than 39 years old, the likelihood of the patient being AA is 76.92% and the probability of the patient being hypo-MDS is 23.08%.The analysis of the effect of independent variables on the model showed that peripheral blood red blood cells had the greatest influence on model classification, followed by medium and young red and late young red blood cells in bone marrow cells (see Figure 5).Combining the above results, logistic regression, decision tree, BP neural network, and SVM are used to evaluate the classification models of hypo-MDS and AA big data from three aspects: authenticity, reliability, and benefit.The results show that, in terms of the comparison of authenticity evaluation, logistic regression, decision tree, BP neural network, and SVM, the decision tree model has the best authenticity.In terms of reliability evaluation, the reliability of the decision tree model is best compared with logistic regression, decision trees, BP neural networks, and SVM.In terms of model benefits, logistic regression, decision tree, BP neural network, and support vector machine have the highest benefit compared to the decision tree model (see Table 6).
After comparison, the sensitivity difference between logistic regression model and decision tree model and between decision tree model and support vector machine has statistical significance (P < 0 05) (Table 7).There is no statistically significant difference among other models (P > 0 05).The difference in specificity between logistic regression model and decision tree model, decision tree model and BP neural network, and decision tree model and support vector machine has statistical significance (P < 0 05).There is no statistically significant difference among other models (P > 0 05).The difference in accuracy between logistic regression model and decision tree model, decision tree model and BP neural network, and decision tree model and support vector machine has statistical significance (P < 0 05).There is no statistically significant difference among other models (P > 0 05).There was a statistically significant difference in the ROC curve area between logistic regression model and decision tree model, between decision tree model and BP neural network, and between decision tree model and support vector machine (P < 0 05).There is no statistically significant difference among other models (P > 0 05).Through the distribution map of AUC, it can be found that the area under the curve of the decision tree is the largest, indicating that the effect is the best, as shown in Figure 6.
Combining the above model evaluation indicators, the decision tree model is the optimal model for classifying big data of hypo-MDS and AA in terms of model authenticity, reliability, and benefit evaluation.

Results
for Test Set Samples.Combined with the above results, the logistic regression, decision tree, BP neural network, and support vector machine hyper-MDS and AA big data classification model are evaluated from three aspects: authenticity, reliability, and benefit.The results show that, in terms of the comparison of authenticity evaluation, logistic regression, decision tree, BP neural network, and SVM, the decision tree model has the best authenticity.In terms of 9 Complexity reliability evaluation, the reliability of the decision tree model is best compared with logistic regression, decision trees, BP neural networks, and SVM.In terms of model benefits, logistic regression, decision trees, BP neural networks, and SVM compare the decision tree models with the highest returns.After comparison, the sensitivity, specificity, accuracy, and area under the ROC curve of the four models were not statistically significant (P > 0 05) (see Figure 7).Although the results of the two comparisons show that the differences between the models are not statistically significant, the performance of the decision tree model is significantly better than the other three models in terms of various indicators of model evaluation.In summary, the decision tree model is the optimal model for classifying hypo-MDS and AA big data, both in terms of model authenticity, reliability, and benefit evaluation.Through the distribution map of AUC, it can be found that the area under the curve of the decision tree is the largest, indicating that the effect is the best, as shown in Figure 7.

Analysis of Cases of Hypojudgment of Hypo-MDS and
AA. Through the model evaluation, we find that the decision tree model is the optimal classification model.Although the decision tree model has a good prediction effect, this model still has the potential to misjudge hypo-MDS and AA.Therefore, it is more conducive to the differential diagnosis of these two diseases of the in-depth analysis of misdiagnosed cases.

Hypo-MDS Misjudgment Case
Analysis.The optimal model decision tree model classified 130 patients with hypo-MDS and classified 13 patients with hypo-MDS as AA patients.Comparing the misjudgment cases with the positive cases, it was found that the red blood cell content and hemoglobin content in the misjudged cases in the peripheral blood cell count were higher than the positive cases.The proportion of mature lymphocytes in misdiagnosed cases in bone marrow smear is higher than that in positive cases.The proportion of early erythroblasts and late erythroblasts was lower than that of positive culprit cases, and the difference was statistically significant (P < 0 05).There was no significant difference among other indicators (P > 0 05).

AA Misjudgment Case
Analysis.The optimal model decision tree model classified 156 patients with AA, and 15 patients with AA were misclassified as hypo-MDS patients.Comparing the erroneously judged case with the positive case, it was found that the erythrocyte content and hemoglobin content in the erroneously judged cases in the peripheral blood cell count were lower than the positive case.The proportion of early erythroblasts, the ratio of red blood cells to young erythroblasts, and the proportion of late erythroblasts in misdiagnosed cases in bone marrow smears are higher than that in positive cases.The proportion of mature lymphocytes was lower than that of positive cases, and the difference was statistically significant (P < 0 05).However, there was no significant difference in other indicators (P > 0 05).

Conclusion
According to the analysis of basic patient data and disease index data, the difference in age and occupational composition between patients with hypo-MDS and AA was statistically significant (P < 0 05).There was no significant difference in other basic data (P > 0 05).For training set, logistic regression, BP neural network, support vector machine and decision tree sensitivity, Youden index, positive likelihood ratio, classification accuracy, positive predictive value, and negative predictive value were evaluated.There was a statistically significant difference in sensitivity between logistic regression model and decision tree model and between decision tree model and support vector machine (P < 0 05).The specificity, accuracy, and area under ROC curve between decision tree model and logistic regression model, decision tree model and BP neural network, and    10 Complexity decision tree model and support vector machine were statistically significant (P < 0 05).For the test set, logistic regression, BP neural network, support vector machine and decision tree sensitivity, Youden index, positive likelihood ratio, classification accuracy, positive predictive value, negative predictive value, the sensitivity, specificity, accuracy, and area under the ROC curve of the four models were not statistically significant (P > 0 05).
The classification effects of logistic regression, decision tree, BP neural network, and support vector machine are compared.The decision tree algorithm has the best classification effect on hypo-MDS and AA, which can help the clinicians to identify and diagnose the two diseases.

Figure 1 :
Figure 1: The core of the medical big data platform.

3. 1 .
Storage Database Construction of Hypo-MDS and AA Cases Big Data 3.1.1.Data Collection for Hypo-MDS and AA Cases.Case data of hypo-MDS patients and AA patients were taken from the Affiliated Hospital of North China University of Technology and the Chinese Academy of Medical Sciences Blood
Data in Hypo-MDS and AA Cases 3.3.1.General Situation of the Research Object.From January 1, 2008, to December 31, 2016, patients with hypo-MDS and AA diagnosed at the Institute of Hematology, Chinese Academy of Medical Sciences, and the Affiliated Hospital of North 325 AA confirmed 118 not the first confirmed patients 51 incomplete information 156 AA patients entered the study 162 hypo-MDS confirmed 19 not the first confirmed 13 incomplete patient 130 hypo-MDS patients entered the study

Figure 3 :
Figure 3: The selection of research object.

Figure 4 :
Figure 4: Decision tree model of hypo-MDS and AA.

Figure 5 : 8 Complexity 5 . 2 .
Figure 5: Analysis of the importance of independent variables to the model.

Figure 6 :
Figure 6: The ROC curve of four prediction models.

Figure 7 :
Figure 7: The ROC curve of four prediction models.
Distributed Optimization of MBDP Based on Big Data.MBDP has been excellent enough in stability and performance, but it has low storage efficiency, cluster load

Table 1 :
The basic characteristics of the patients.

Table 4 :
Classification accuracy of training set and test set (n (%)).

Table 5 :
Decision tree model evaluation.

Table 6 :
The comparison of the training set evaluation index.