Data mining process for predicting diabetes mellitus based model about other chronic diseases: a case study of the northwestern part of Nigeria

To predict diabetes mellitus model data mining (DM) based approaches on the dataset collected from the seven northwestern states of Nigeria. Data were collected from both primary and secondary sources through questionnaires and verbal interviews from patients with diabetic mellitus and other chronic diseases. Some hospital data were also used from the records of patients involved in this work. The dataset comprises 281 instances with 8 attributes. R programming software (version 5.3.1) was used in the experiments. The DM techniques used in this research were binomial logistic regression, classification, confusion matrix and correlation coefficient. The data were partitioned into training and testing sets. Training data were used in building the model while testing data were used to validate the model. The algorithm for the best-fitted model converges with null deviance: 281.951, residual deviance: 16.476 and AIC: 30.476. The significance variables are AGE, GLU, DBP and KDYP with 0.025, 0.01, 0.05 and 0.025 P values, respectively. The predicted model accounted for the accuracy of ∼97.1%. The correlation analysis results revealed that diabetic patients are more likely to be hypertensive than patients with other chronic diseases considered in the research.


Introduction:
The recent developments in biotechnology and health sciences have led to a significant production of data, such as clinical information, generated from large Electronic Health Records. This information can be used to forecast and scrutinise health care ratios of the entire population. By using Data Mining (DM) techniques, it is possible to extract hidden and useful information from datasets known as Knowledge Discovery in Database and Computer-based information system [1].
Logistic regression is a statistical technique used in predicting the probability of an event given a set of predictor variables. The procedure is more sophisticated than linear regression procedures. Binary logistic regression procedure empowers one to decide on the predictive model using binary dependent variables. It explains the relationship between a binary dependent variable and a set of independent variables. Independent variables can be continuous or discrete. Logistic regression as a non-linear regression model is a special case of the generalised linear model (GLM) [2] where the assumption of normality and constant variance of residuals is not satisfied. Logistic regression models have demonstrated their precision in many classification frameworks [3].
The significance of DM in health sector elevates further challenges, which entails explicit processes and tools. Cross-domain knowledge is of paramount importance to accomplish practical results. The brisk evolution in the automation of the healthcare industry gives a huge amount of heterogeneous, mutually structured and unstructured data accessible for research and secondary use. There are several algorithms implemented to categorise, bunch, and find hidden patterns in data. Domain-Specific issues of health care are yet to be resolved. As discussed by Abidi and Hoe [4], particular problems have been resolved in the effective appliance of DM systems. According to their studies, besides resolving depersonalisation, multi-relational and media data pre-processing clinical data heterogeneity, and quality issues, the DM process is suboptimal or infeasible.
Diabetes is a persistent health problem and pandemics. In developing countries, customary tribal societies are adopting a contemporary lifestyle, while developing continual health problems usually associated with developing nations [5]. The direct and indirect problems caused by the disease surpassed the financial and human resources of the health care system in sub-Saharan Africa (SSA) [6]. Presently, hypertension, diabetes, and coronary artery diseases are among the foremost continual health conditions observed in SSA [7].
The projected predominance of diabetes in Africa is 1% in rural areas and 7% in urban SSA [8], while the incidence in Nigeria varies from 0.65% in rural areas to 11% in urban areas. Data from the World Health Organization (WHO) reported that Nigeria has the greatest number of people living with diabetes in Africa [9]. Nigeria, as the most densely populated countries in Africa, has approximately 196 million people in a million km 2 area. Nigeria is also the seventh leading population in the world [10]. According to the United Nations, Nigeria's population will attain 411 million by 2050. Nigeria may then be the third most populous country in the world. In 2100, the population of Nigeria may reach 794 million [11]. The northwestern region is the second largest geopolitical area, covering 216,065 km 2 and the most densely populated areas with an estimated population of 45 million people [12].
Recently, researches on diabetes in Nigeria were conducted with a plan to investigate and evaluate the incidence of diabetes among different social and economic groups in Port Harcourt. The model for Nigerians may emerge and be able to ascertain whether or not those with high blood glucose are aware of their diabetic problem [13]. In 2008, the benchmark for diabetes studies [14] was conducted athwart some selected Health centres in Nigeria, with objectives, clinical and laboratory profile evaluating the eminence of care of Nigerians diabetics with a view to planning and improving diabetes care. Another related study was carried out in northwestern Nigeria to assess diabetic patients' compliance of the management, including Socio-demographic factors influencing their conformity [15]. However, in spite of the growing prevalence of diabetes mellitus and other chronic diseases, particularly in northwestern Nigeria, to the best of my knowledge, there has been a paucity of research and awareness in the area.

98
This is an open access article published by the IET under the Creative Commons Attribution License (http://creativecommons. org/licenses/by/3.0/) Given this, a research was intended to be carried out to predict diabetes mellitus models about other chronic diseases using DM-based approaches, and northwestern part of Nigeria as the case study.

Material and methods:
This section seeks to explain the analytical platform and the methods used in this Letter. This consists of sampling techniques used, dataset description, collection, issue of ethical approval and the software programming language used for the experiments. The DM techniques used include binomial logistic regression, classification and confusion matrix, and correlation coefficient.
2.1. Proposed analytical platforms for the model: The proposed analytical platform of the predicted model is presented in Fig. 1

Sampling techniques used in the process of data collection:
The sampling techniques in distributing questionnaires and determining the sample size used in this paper are both combined probability and non-probability sampling techniques. A probability cluster sampling was applied at the beginning where the entire population was divided into groups or clusters; it's related to the selection of a subset of individuals from the population to estimate the characteristics of the entire population [27]. Each attribute determines one or more properties of the observable subjects distinguished as independent individuals. On the other hand, a nonprobability sampling technique was used by the researcher to select samples based on subjective judgment rather than random selection. A convenience sampling process by which samples are drawn from the population because they are conveniently available to the researcher also employed.
The entire northwestern part of Nigeria comprises of seven (7) states. Each State was divided into three (3) clusters according to their senatorial zones (i.e. South, central and north). Governmentowned hospitals were chosen in each cluster of the six (6) states, while in Kano, the number of the hospitals was doubled due to its population. Our target population is diabetes and other chronic disease patients. To achieve greater precision in the data collection, the author decided to distribute the entire questionnaire by himself across the states as well as interview the patients with the help of some hospital staff.
2.2.3 Data collection: Data were collected from both primary and secondary sources through questionnaires and verbal interviews from patients who have diabetes mellitus and other chronic diseases. Some part of the hospital data were also used from the record departments of all hospitals under our study. The dataset comprises 281 instances with 8 attributes for this particular study. The attributes were abbreviated as; diabetes mellitus patient's (TYPE), patient's age (AGE), patient's glucose level (GLU), patient's diastolic blood pressure (DBP), a patient's body mass index (BMI), Symptoms related to kidney problems (KDYP), Symptoms related to heart/cardiovascular problems (HETP) and Symptoms related to eye problems (EYEP).
The details of attributes can be seen in Fig. 2. 2.3. Variable selections: Variable selection is a process of selecting leading variables from the datasets and removing unrelated features, concerning the task to be performed. The purpose consists of identifying a set of P process variables, P < J able to better explain and envisage the response variable y [16]. Stepwise variable selection manner is a recipe of backward elimination and forward selection processes. It addresses both processes, based on the significance of score statistics, and the probability of likelihood-ratio statistics on the conditional parameter estimates. Variables can be removed, added or changed in the processes at each stage. Akaike information criterion (AIC) was used to check model adequacy.
where P is the potential predictors. The design matrix of independent variables, X, is composed of N rows and K + 1 columns, where K is the number of independent variables specified in the model, for each row of the design matrix, the first element x io = 1. This is the intercept. The parameter vector, β, is a column vector of the length K + 1. There is one parameter corresponding to each of the K columns of independent variable settings in X, plus one, b 0 for the intercept [17]. The transformed logit, which is the logistic regression model, equates the log-odds of the probability of success, to the linear component as: where u i /(1 − u i ) is known as the odds of an event. Suppose y takes the values 1 for an event and 0 for a non-event, hence y has a Bernoulli distribution with probability parameter (and expected value) p.

Correlation coefficient:
Is a statistical technique used to indicate the degree of the relationship between the variables. As well as the strength and direction of the relationship. The strength of the relationship can be a range from plus or minus '( + or − ) 1 to 0', the stronger the relationship, the closer the value is to 1 [18]. The Pearson correlation coefficient equation is presented as follows: 2.6. Classification accuracy: As fundamental techniques for DM process [19], classification techniques can be used to create an idea of the type of customers, objects and items by specifying multiple attributes to specify the defined class. The main goal of classification is to assign a class to find previously unseen records as accurately as possible. If there is a collection of records (called a training set) and each record contains a set of attributes, then one of the attributes is a class [20,21]. The motive is to find a classification model for class attributes, where a testing set is used to determine the accuracy of the model. The known figures set are divided into training and testing sets. Training sets are used to build the model and testing sets are used to validate it [22,23]. Classification process consists of a training set that is analysed by classification algorithms and the classifier or learner [24]. Model is, therefore represented in the structure of classification rules [25]. Testing data is used in the classification rules to estimate accuracy. The initial model is represented in the form of classification rules, decision trees or mathematical formulas.   Table 3 presents a confusion matrix for the actual and predicted values of the logistic regression results. Table 4 presents a correlation matrix describing the level of relationship between the variables TYPE, KDYP, HETP, EYEP and HBPK.
The ROC curve for the TP and FN values is shown in Fig. 3. The graphical representation of correlations between the variables TYPE, KDYP, HETP, EYEP and HBPK is shown in Fig. 4, from the correlation matrix given in Table 4. The darker ellipse indicates a strong positive correlation, while white shows no correlation.
3.2. Discussions 3.2.1 Logistic regression: Tables 1 and 2 present logistic regression results for R algorithms, from the healthcare dataset with Eight (8) attributes; TYPE as the dependent variable and AGE, GLU, DBP, BMI, KDYP, HETP and EYEP as the independent variables. The data were partitioned into training and testing sets; the training part was used in building the model, while the testing part was used in validating the model.
In Table 1, the algorithms for the selection of factors with significance effect converged after 12 Fisher's iterations, variable AGE, GLU and KDYP were significant at 0.025, 0.025 and 0.05 P values, respectively, with the null deviance: 281.951, residual deviance (RD): 16.302 and AIC: 32.302.
In Table 2, less significant variables were removed by the model, after 11 Fisher's iterations the algorithms for selection of factors with significant effects converge with null deviance: 281.951, RD: 16.476 and AIC: 30.476, variables AGE, GLU, DBP and KDYP were significant at 0.025, 0.01, 0.05 and 0.025 P values, respectively, with the exception of BMI. Hence variable BMI will not be removed from the model. Table 3 presents a confusion matrix for the adequacy of the predicted model. The model predicted as follows: Seventeen (17) times the patient was actually diabetic; the model also predicts as diabetic. Fifty-one (51) times the patient was nondiabetic; the model also predicts as non-diabetic. Two (2) times the model was actually non-diabetic; the model predicts as diabetic (Type I) error. Zero (0) time the patient was actually diabetic; the model predicts as non-diabetic (Type II) error. This accounted for the model accuracy of ∼97.1%. Fig. 3 presents the ROCR graph plot, for the confusion matrix TP rate and FP rate, for the prediction and performance of the model. Also was used to choose the threshold for the confusion matrix prediction, in the case that the default needs to be changed. Table 4 presents a correlation matrix for diabetes mellitus and some other chronic diseases. The correlation coefficient was used to check the relationship between them. It could be observed that the symptoms related to kidney problems have a relatively higher correlation with symptoms related to heart/cardiovascular problems (0.32), followed by diabetes mellitus and high blood pressure (0.25). There is a negative correlation between symptoms related to kidney problems and symptoms related to eye problems (−0.01). The observed correlation between diabetes mellitus and symptoms related to kidney problems is (0.09), diabetes mellitus and symptoms related to heart/cardiovascular problems are (0.08) and diabetes mellitus and symptoms related to eye problems (0.13). These are all positively weak. Fig. 4 presents a pictorial representation of the correlation matrix. The darker the colour, the more strongly the relationship and vice-versa.

Conclusion:
The result of this study builds a valid, adequate and comprehensive model for predicting diabetes mellitus about other chronic diseases. From the fitted models, the best model which describes the relationship between the variables with the highest precision in its algorithm converges to a minimal RD and AIC values of (16.764) and (30.764), respectively. Moreover, the model explains the accuracy level of 97.1% in the confusion matrix result. Furthermore, from the correlation coefficient results, it has been revealed that diabetes mellitus patient is more likely to be hypertensive than the remaining chronic diseases.  7. Ethical approval: All the study procedures performed were according to the Helsinki declaration and ethical approval was obtained from the Yanshan University Ethical Board.
8. Research involving human participants and/or animals: All procedures performed in studies involving human participants were by the ethical standards of the Yanshan University research committee and with the 1964 Helsinki declaration and its later amendments or comparable ethical standards.