Usage of Kernel K-Means and DBSCAN Cluster Algorıthms in Health Studies: An Application

With clustering methods variable and individuals which have similar characteristics may be collected in a group. Although clustering methods have many applications, there are limited studies in health researches in our country. While the purpose of this study is to introduce Kernel K-means and DBSCAN clustering algorithms and show how and which cases should be correctly used. At the same time, different clustering algorithms results which can be applied on a real data set were compared. According to the evaluations, the Kappa coefficients were statististically significant and it’s degree are low. In terms of data sets the most convenient and fastest algorithm is Kernel K-Means clustering algorithm. The results obtained by Kernel K-Means gave the most accurate desicions in terms of the distribution of the groups among Framingham risk groups crosstables. As a result, with taking into account the criterion of clinical information it is thought that the examination of clustering of risk factors of the disease, will be played an important role for introduction of accurate disease diagnosis. In addition we believe that when considering data distribution and characteristics of data sets clustering algorithms can be used as a diagnostic tool for the plannings and diagnosis of diseases in the field of health.


Introduction
Today's science accepts the information which is based on the evidences. The substance of data come along with some calculations which is used by statistic methods and principles. Information and technology improve and increase with avoidably fast and this is known by everybody. To get beneficial and useful results from the dense information clustering, using the advanced and wide statistic methods become almost necessary. The theory of technology's developments after the internet cause that suggested statistic method is made practicality faster. Then using these methods help us to understand complex information better and to interpret the real world better [1][2][3].
When the informations of so many individuals with so many characteristics exists, methods which evaluate the situations better, are collected under the general title of 'Data Mining'.
Like in every area, in health area records are transferred to computers in other words to the data base. So, datas can be kept for a long time and it becomes useful. Also in the course of time a lot of information can be reached easily. When data base are made according to a plan, it helps that a lot of new hypothesis rise and are tested [1][2][3].
There are a lot of algorithms which take place under the title of data mining and are used for estimating, grouping, classifying and diagnosing which have a lot of application area in recent years. Also main objectives of a lot of hypothesis of health research are one of the algorithms or some this algorithms. Although diagnosing, classifying or estimating methods are used commonly in other areas like health areas the methods which are developed so as to grouping or clustering to variable and individuals are seen that they aren't used commonly. This results gives us an impression that the researchers have limited information about how to and where to use and how to interpret the information which is taken by the help of the clustering methods [1][2][3].
Also except for some clustering algorithms which exists in commonly used statistic package programme, new clustering algorithms which is developed especially in recent years, are limited in use in literatüre [1][2][3].

Clustering
Clustering is the action of collecting similar characteristics. While the objects which is in the same cluster are similar to each other in terms of characteristics which are studied. Objects which take place in different clusters aren't similar to each other in terms of such characteristics [1][2][3].
In other words homogeneous groups are obtained by the help of clustering. This forms the principle of clustering analysis. This analysis is unsupervised. Because a data set in which individuals exists in a complex way and nobody are known to in which group they will enter,by the help of clustering analysis homogenous groups are emerged. So, many objects are divided into realistic groups. By the help of these methods not only objects but also characteristics can be clustered. Result of this action is so similar to factor analysis. With one side it clusters many characteristics and it helps size reduction. As in practice to cluster characteristics, methods like factor analysis and structural equation modelling is used, clustering analysis is not needed [1][2][3].
Unconsciously we often cluster person or people according to definite criteria in area of everyday life. For example in childhood term, to distinguish cat and dog, plant and animals easily some characteristics are considered.
Accordingly, we develop clustering outline preogressively in our subconscious. Clustering analysis are used in various different disciplines with different form and objective. For example it can be used for objectives like to identify the geogrophic distribution of anillness, to cluster of a car accident, to govern of hospitals stuffs, to treat the conditions of a hospital, to time of an ambulance service, to diagnose of an illness, to identify the obese people groups, to examine similar illnesses [1][2][3].
Clustering analysis is sometimes used as a space analysis. It is seen that lojistic regression is used with combiningmany statistic methods like discrimant analysis,one side variance analysis. In a situation like this firstly by the help of the clustering analysis, homogenous groups are made then with the other methods groups can be analyzed in a best way in terms of which groups may be decomposed or in each homogeneous groups might be made seperate action [1][2][3].
There are some basic characteristics of an effective and accurate clustering algorithms. A suitable clustering algorithms should explore clustering configurations which have different shapes and sizes with scanning in one go and also it should be practicable for all kinds of data in terms of quality and quantity. An effective clustering method should be suitable for both big and small database without distinguishing the size of the database. This is also shows that whether clustering algorithms has a characteristics of scalability or not.
An effective algorithms should know what to do towards an effective, noise datas and shouldn't be influenced. Besides of the referred criterias, an effective clustering algorithms should be easily practicable, interpretable, functional and obvious [1][2][3].

Kernel K-Means Clustering Algorithm
Kernel K-means clustering algorithm is an extension of the classical K-Means non-linear method. Firstly it developed by Girolami in 2002 iteratively. In this model entry point in multidimensional space is mapping to (.) by using nonlinear transformation and thus cluster error is minimized. The distance between the cluster center and a data point is calculated using a kernel function in space. The (X).
(Y) point is produced between X and Y's image. This point can be calculates using K(X.Y) formula(55,91).
The K(X.Y) function is a kernel function that provides K:DXD→R. In below there is a list of some of the standard kernel functions [4][5][6].
In iterative process, for X i and X j points it is required to calculate ( , ) i j K X X . Thus the matris H=[K ij ]n*n, which is called kernel matris, is obtained.
n is the size of the data set. Kernel matrix is calculated and recorded previously. The time and space complexity is O(n 2 ). This is the disadvantages of Kernel K-Means clustering method. This complexity shows that for very large databases this clustering algorithms does not give proper results. And another disadvantages of this method is the need of determine the number of clusters initially and the core points, kernel functions are affects clustering results a lot.
In order to looking operation stages of method, it is assumed that a data like 1 2 { , , ., } n D X X X = … . K is the number of clusters that required and M (0) is starting core point, П (0) shows starting partitionresults.
In the above formula m j is shows the average of C j cluster and it is calculated using the following formula is calculated using the following formula by knowledge of (.) [4][5][6].
In the above formula ( ) , i j F X C and G(C j ) equations are calculated as follows.
For every j C cluster find Cj ve ( ) For every i X and j C cluster calculate ( ) Repeat Step 1 and Step 4 until the results of clusters are close to each other.
And the results are given as

DBSCAN Clustering Algorithm
DBSCAN clustering algorithm is density based algorithm. It first developed by Martin Ester and colleagues (33). DBSCAN algorithm is used to find the clusters with arbitrary shape and noisy data. With other clustering methods, it is very difficult and laborious to examine arbitrary shapes. Also the cost of calculation is the highest [4,[7][8][9].
DBSCAN clustering algorithm is only used one input parameter and with this parameter is identifies the clusters that have large spatial data sets by looking local density in the database [4,[7][8][9]. In addition this method is decide what information is outlier or noise data.
For example suppose there are databases like in Figure 3, is the points really create a cluster or of the set or is it small size cluster that consists a lot of outliers. At this point DBSCAN clustering algorithm is given decisions correctly in those consequences [4,[7][8][9].

ε(Eps) Neighborhood
It is a neighborhood of a object within ε radius.
Eps is a proximity distance to determined the neighboring of a object. In other words, it is the maximum distance between any two points in a cluster.

MinPts
If a point is intensive, then it should be have minimum number of points in the neighborhood ε .

Core point
If a object includes at least MinPts observation in the neighborhood ε, then this a core point.

Density reachable
When there is a chain like p 1 ,p 2 ,…p n, the equalitys are p 1 =q and p n =p. Under the conditions about Eps and MinPts, the density reachable is provided via p i+1 , p i points [4,[7][8][9]. This is summarized as follows. Suppose there is p and q points. In this cases, the p point is directly density reachable to q point. But q point is not directly reachable to p point. Because q point ise a core point (we assumed that MinPts ise three) [4,[7][8][9].

Directly density reachable
Where the following conditions are provided the p point is directly density reachable from q point [4,[7][8][9].

Density connected
Under the conditions Eps and MinPts,a p point is density connected with the help of o point and this o point is density reachable both p and q points. This situation is illustrated as follows [4,[7][8][9] (Figures 4 and 5).

Noise
A set of points that not belong to any cluster in the database.
To summarize above descriped concepts, the following example is given. Assuming MinPts is three, in Figure 6 the points m,p,o,r are core point. Because they have three and more than three points in the ε neighborhood. q point is directly reachable to m' and m is directly reachable to q' point. The opposite is also true of these provisions. The point a does not belong to any cluster. So this a point is a noise point [4,[7][8][9].
In DBSCAN clustering algorithm there is two condition that in order to include q point in K cluster. The first condition is, if p point     DBSCAN clustering algorithm is quite effective for arbitrary according to apporiate parameters and MinPts which are set by user. If a spatial index is used, algorithms computational complexity is O(nlogn). In this formulation n is the observation number in data set. In other cases the computational complexity is O(n 2 ) [4,[7][8][9].

Data
In our country cardiovascular diseases are the most common ailments in recent years. Cardiovascular term is used as a generic name for the heart and vascular disorders. It leads significant deterioration of the person's quality of life and causes of death. Therefore, to prevent the increase of cardiovascular problems and to reduce deaths, it is necessary to determinate of risk factors and reduce the risk of death from cardiovascular disease by making the necessary interventions. In this study, for determining cardiovascular risk groups Framingham risk score was used. This score was frequently used in recent years. Framingham risk score was calculated on people living in the town of Framingham which is a state of Massachusetts in the United States. The American Heart Association has created the Framingham risk scale using this data. The purpose of creating this table is decreasing death rates by taking the necessary measures to identify individuals at risk of cardiovascular events.
The data used in our study was covered in a 3 year period between 2012-2014 and datas were collected over people who wanted to lose weight and had periodic inspection in Duzce University Family Medicine Polyclinic. 4788 individual measurements were collected. Also Framingham risk scores were calculated by using a written macro and the individuals were categorized as low, intermediate and high risk group. For Framingham score, individuals which have less than 10% risk is found low risk, with the risk between 10% and 20%intermediate risk, higher than 20% ise found high risk groups. Evaluation of the first stage, descriptive statistics were given as mean and standard deviation for numerical variables, for categorical variables statistics were given as frequence and percentage. The variables age, gender, smoking, systolic blood pressure, total cholesterol, HDL and using blood pressure medication were used for calculating the risk groups with Framingham risk score.Kernel K-means and DBSCAN clustering algorithms were used as clustering algorithms. For determining which variables are significant distinctions between the generated clusters one way ANOVA was used for numerical variables and Tukey was used for post-hoc test. Similarly, the relationships between categorical variables and clusters were investigated with Pearson Chi-Square test. Finally, adaptions of clustering algorithms were evaluated with Kappa statistics. The statistical significance level was 0,05 and WEKA (ver. 3.4.11), RapidMiner (ver. 6.4), SPSS (ver.20) was utilized for the analysis.

Results
Considering Table 1 results, 83.5% of people who participated in the study sample (n=3998) were female, 31.3% (n=1501) were consumed cigarettes, 12.1% (n=579 persons) used blood pressure medication. As a result of Framingham risk score 4,2% (n=203) were identified at high risk group.
All numerical variables descriptive values were given as mean, standard deviation, minimum and maximum in Table 2.
Considering age, gender, smoking status, systolics blood pressure, total cholesterol, HDL and using blood pressure medication status in the data set, in the implementation of Kernel K-means and DBSCAN have been identified in 5 steps. For Kernel K-means, the ratio of higher risk group was 19,6% in Table 3. The proportion of individuals classified as high-risk method of DBSCAN found quite a lot. As a result of DBSCAN clustering algorithm the proportion of low risk individuals a Figure 6. Sample Data Set Used In The DBSCAN Algorithm Implementation. were very large. When evaluating clusters, DBSCAN clustering method did not used 380 people that extreme point. Table 4 shows the results of comprasion of the average age, systolic blood pressure, total cholesterol and HDL between Kernel K-Means and DBSCAN clustering groups and p value is shown. Among the clusters, which are obtained with the application two mtehods, there were significant differences in terms of the numerical values.For Kernel K-Means and DBSCAN methods, high risk's group age, total cholesterol was significantly higher than other clustering groups. But in DBSCAN method intermediate groups Systolic Blood Pressure was significantly smaller than low and high risk groups. When results were examined by HDL, Kernel K-Means and DBSCAN results were very different. In Kernel K-Means algorithm group with high risk's HDL average was significantly higher than other clustering groups. But in DBSCAN clustering method group with low risk had significantly higher average HDL. It showed us that two methods results were not same.
Categorical variables, which were used to calculating Framingham risk score, and cluster results that obtained Kernel K-Means and DBSCAN relationships comprasion results were given in Table5.
Considering Table 5 about Kernel K-Means method there were significant differences in terms of gender, smoking status, Using Blood Pressure Drug Status. But in DBSCAN clustering method Using Blood Pressure Drug Status is not a risk factor for cardiovascular disease.
The adaptation of clustering DBSCAN and Kernel K-Means clustering methods calculating Framingham risk score.
Adaptions of each of the clustering results were investigated with Kappa statistics by the risk factors about Framingham risk score and the results were shown collectively in Table 6. According to Table 6, there was significant harmony among Kernel K-Means and DBSCAN clustering algorithms. But the value of Kappa statistics were quite small.
Framingham risk score groups and clusters that obtained from the two clustering algorithm relationships were given in Table 7. When considering Table 7, it was seen that DBCSAN clustering algorithm's results were not significantly agreement with Framingham risk score.
But Kernel K-Means clustering algoritm results were significantly agreement with Framingham score.

Discussion and conclusion
Clustering algorithms are used in a variety of workspace. For example making text clustering on social networks like Facebook and Twitter, grouping the students according to social skills at school, the classification of a variety of diseases or symptoms of disease, developing disease groups by combined laboratory and clinical findings, identificating of epidemics, regional analysis, identificating the different morphology of the heart sound and classification of physiological status. At the same time, clustering algorithms were preferred to used for the detection of homogeneous groups before many statistics test as a pre-analysis [10][11][12][13][14].
Based on types of variables and presence or absence of outlier/noise observations in the data set, a variety of clustering algorithms have been developed. In particular, the number of clustering algorithms has been increased in the last 10 years. This result is an indication of increased need for clustering [10][11][12][13][14][15][16][17][18][19][20][21][22].
Framingham risk score, which we used in our study, has been used many international and national studies for predicting cardiovascular risk score. This score was used above 1200 articles for determining the turning point of cardiovascular disease. However, when calculating the risk score, limited number of risk factors is located in the model. This factors includes age, gender, total cholesterol, systolic blood pressure, smoking status and drug using status. Framingham risk score is a mathematical model that predicts the risk of cardiovascular disease over 10 years. However there are disadvantages of this method. It only predicts the cardiovascular disase risk not the other heart illness. And this model is only applied for individuals that have no heart disease before. Another disadvantage of this score is, in cohort study the young people are less represented. Therefore, it intensively emphasis on age. Framingham risk score informations have been developed from people who are living in Framigham area. However this region contains more white race and homogeneous groups. Because of this Framingham risk score may not be accurate for different ethnic and cultural structures. Framingham risk score is only able to develop a 10 year period predictive model. The model included 10 year period risk factors effects may be stronger depending on the time. But this score is neglect this effect. For this reasons Framingham risk score should not be used as a gold standard test for the estimating risk score of cardiovascular disease in our country. However, it will be benefited for a preliminary assessment for studies [23].
In this direction, we investigated the risk factors of cardiovascular disease that caused the most of deaths among all disease. By using the risk factors we clustered people into three groups including low risk, intermediate risk and high risk. Then we compared the clusters that obtained from Framingham risk score and DBSCAN, Kernel K-means results.
Kernel K-Means clustering algorithm has been observed more consistent results compare with DBSCAN method due to the small amount of categorical variables and large size of own data set. However, Kernel K-means method gives correct results when applied clustering process with accurate parameters, despite the lack of a convenient method of clustering large data sets. At the same time, Kernel K-Means method produce faster results used for data set. DBSCAN clustering algorithm is more successful to find low-risk individuals findings while Kernel K-Means clustering method has a tendency who are at high risk.
In this research, high risk of cardiovascular disease has more valuable to find clinically so upscale ratio is desired in cross table. Although, for some studies, they can be approved success such as finding a low-risk individuals. In this case, the results should be analyzed this assessment extent. In these data sets, researchers make mistake while thinking of this algorithm gives definitive results. For these reasons, clustering algorithms should be reviewed whole clinical informations evaluated with criteria, assumptions, algorithm conditions, disadvantages and advantages. Statistical methods should be supported that the clinical findings therefore the application stage continues easily and correctly in backstage [10][11][12][13][14].
As a result we propose to increase the clustering algorithms applications in the health field. If the correct method is used in the development of health policies, it is necessary to taking the necessary precautions for determined the risk of disease. Thus it will lead to improved quality of life for our people and the average life expectancy will increase. Therefore, even a simple clustering algorithm perhaps will cause a lot of different developments in the field of health and medicine [10][11][12][13][14].