Real-Data Comparison of Data Mining Methods in Early Detection of Chronic Obstructive Pulmonary Disease ( COPD ) in General Practice

C l i n M e d International Library Citation: Rodríguez-Álvarez C, Félix R, González-Dávila E, González-Martín I, Beatriz C, et al. (2016) Real-Data Comparison of Data Mining Methods in Early Detection of Chronic Obstructive Pulmonary Disease (COPD) in General Practice. J Fam Med Dis Prev 2:045 Received: September 28, 2016: Accepted: November 03, 2016: Published: November 07, 2016 Copyright: © 2016 Rodríguez-Álvarez C, et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Rodríguez-Álvarez et al. J Fam Med Dis Prev 2016, 2:045


Introduction
Chronic obstructive pulmonary disease (COPD) is one of major cause of morbidity and mortality in developed countries [1].In spite of the fact that this disease is narrowly tied to tobacco consumption and that the developed countries are adopted important campaigns for the prevention of it, the disease prevalence and mortality continue increasing worldwide.According to clinical forecasts, in 2020, this disease will be the fifth cause of disease and the third cause of mortality, worldwide.Despite of that reasons, this disease is receiving, in the last years, an increasing medical attention, nevertheless still it is relatively ignored by the population, by the public health and the governments [2].
COPD has a significate impact in the quality life of patient and in costs supported by health system.A severe form of COPD is the most common condition that requires hospitalization and substantially contributes to the economic related impact.This includes the excessive cost of all the medicines with medical prescription, attention in general medicine, emergency rooms and the episodes of hospitalization [3].
COPD is a complex, chronic and progressive disease characterized by the chronic inflammation and irreversible air flow obstruction, which involves structural changes in the lung.The principal symptoms are the difficulty in breathing, cough and expectoration.In the clinical presentation are different phenotypes, very heterogeneous, with prognostic and therapeutic clinical repercussions [4].
Though COPD is not a curable disease, to stop smoking is the most effective measure for prevention and to stop the progression.COPD's clinical diagnosis must think about every patient with a respiratory difficulty, chronic cough or high production of secretions and a history of exposition to risk factors of the diseases [5].Different authors indicate as significant factors related to this condition: the masculine sex, age, consumption of tobacco (number of packages per year), cough, expectoration, difficulty in breathing and other respiratory symptoms [6][7][8][9][10]  Several studies indicate that to achieve a good control of COPD is essential to do a diagnosis in the first stages of the disease, as well as to adopt more appropriate preventive measures and to assure a systematic control and a good follow-up of the disease [2,11].
Therefore, the early detection and the diagnosis of the COOD play an important paper in the effective strategies of prevention.Nevertheless, there is not any general model that has been generalized in primary healthcare for this purpose.For that, it is important to provide an efficient and precise model to predict the population at risk of suffering COPD, to identify the not diagnosed people who require a diagnosed by spirometry.
Certain authors address the essential issue of how to implement a selection model to diagnose COPD in early stages [7,12].Dirven, et al. [8,9] indicate that questionnaires could be conducted in primary care to detect respiratory health problems and, depending on their results, subsequently prescribe a spirometry leading to an accurate diagnosis.Clinicians and health service researchers are frequently interested in predicting patients' specific probabilities of adverse events (e.g.death, disease recurrence, post-operative complications and hospital readmission) [13].Data mining has helped to predict under-diagnosed patients, as well as to identify and classify atrisk people in terms of health [14][15][16].The aim of this study was to determine which factors allow to discriminate between people with COPD and which do not, trying to provide a simple tool that can be used by primary health care personnel.To answer this question, we have applied five different methods commonly used in data mining, two methods of decision trees, two of decision rules and one method of decision function.The results enable high efficiency prediction of COPD using a reduced number of factors which may be easily employed in the field of primary health care.

Design
To carry out the objectives of this work, a cross-sectional epidemiological study was conducted on the Island of Tenerife (Spain) during the period running from September 1, 2011 to December 31, 2012, in which individuals, smokers of both sexes, between 40 and 69 years of age were included.
We selected one third of the 37 health centers in Tenerife and stratify in the four geographical areas of health: Metropolitan, North, Southeast and Southwest, with proportional allocation to the centers of each, 16, 12, 4 y 5 respectively.Thus, a total of 12 centers were selected.The necessary permissions were obtained, as well as the collaboration of family doctors, nurses and staff from the centres.All centres had a spirometer model Datospir 120 (Sibel S.A.), which is the model available in primary care centres in Tenerife.
The total number of participants, 2,163 was determined using the total population between 40 and 69 years in Tenerife (Continuous Register 2010, 348,844 inhabitants), a limit proportion of COPD in smokers of 15%, a significance level of 5% and the accuracy in the estimating of 1.5%.These participants were randomly selected in the 12 centers included after a deal with proportional allocation based on the number of patients assigned to each center.
Inclusion criteria considered were: to be between 40 and 69 years of age, to have a positive history of smoking (current or ex-smoker), not having any previous test of respiratory diagnosis, and to be willing to collaborate and subsequently sign the informed consent.Thereafter, an appointment at the health centre was arranged with these people.The exclusion criteria were: to be out of range age, to be neither smoker nor previously smoker, patients with previous COPD diagnosis.
The appointments of the individuals were made by sending them a personalized letter containing a brief overview of the study and indicating they had been randomly selected.They were invited to participate in the study and informed that they would be telephoned in the coming days.
Affiliation data, age, sex and the values provided by the questionnaire of the European Coal and Steel Community (ECSC) on respiratory symptoms translated and validated in Spain [17] were collected from participants.This questionnaire includes different sections related to respiratory disease, such as the existence of cough and expectoration, dyspnoea, wheeze and chest oppression, among others.A smoking index in term of pack-years calculated with the information on the number of cigarettes smoked per day and the number of years the person has smoked was also collected.
All participants were previously instructed about the test to be performed.On each individual three spirometry were performed by a single skilled person, following the recommendations of the American Thoracic Society and always using the same type of spirometer already mentioned.
Lung function measurements included forced expiratory volume in 1 second (FEV1), forced vital capacity (FVC) and their ratio (FEV1/ FVC).FEV1 and FVC were expressed in litres and as the percentage relative to the reference values for the Spanish population.According to the Spanish COPD guidelines and as proposed elsewhere for mass screening programs, we used pre-bronchodilator lung function to classify airflow limitation, defined by an FEV1/FVC ratio < 0.70 [18].

Data analysis and experimental configuration
All statistical analyses and prediction models were conducted in SPSS 21 for Windows (IBM SPSS Statistics, Chicago, IL, USA) and Weka 3.6.3.(Waikato Environment for Knowledge Analysis, GNU-GPL).No missing data occurred because the researcher was present at the taking of information.Different data mining methods were tested with the intention of obtaining a good model for predicting COPD.In particular, two methods of decision tree, J48 (version of C4.5 in Weka) and CART, two methods of decision rules, JRip and PART, and one method of decision function, the logistic regression (LR) [19] were used.To increase the predictive quality, we initially applied the wrapper-based approach in the variable selection process [20].This allowed us to find a quasi-optimal set of variables associated with the data mining method which would then be applied.In all cases, a Genetic Search algorithm was used.Other search algorithms, as Best-First, Greedy Stepwise, Random or Exhaustive, were discarded for producing a very small set of variables or excessive computation time.
Area under the receiver operating characteristic curve (AUROC), sensitivity, 1-specificity (false positive rate), F-measure, and Cohen's Kappa are reported to assess the efficiency of selected models.These statistics are shown using both the total sample as training set, as well as after evaluating the model with 10-fold cross validation.Since in these studies the sample has a very high number of non-COPD compared with COPD and the validation test of disease is not too expensive, we used sensitivity as the primary criterion for the comparison of predictive power.

Results
Of 2.163 individuals previously selected, 18.7% (24.4% of men and 13.6% of women) met the inclusion criteria and went to their health centres on the day concerted (Figure 1).The sample was finally composed of 265 men and 147 women.
Of the 402 individuals analysed 45 (11.2%) were diagnosed as COPD.The percentages of non-COPD and COPD for each variable considered in this study are presented in table 1. Predictive power of individual characteristics for COPD, measured as AUROC and odds ratio, is also shown in that table.In particular, 14.6% of men had COPD compared with 5.3% of women (p = 0.005).The percentage of COPD increased as a function of age group, from 3.2% in those younger than 50 to 21.9% in those over 60 (p < 0.001).Similarly, the percentage of COPD increased the greater the smoking intensity (pack-years) from 1.1% in the group with < 15 pack-years to 19.7% in the group with more than 30 pack-years (p < 0.001).Of the items listed in the ECSC questionnaire, those relating to dyspnoea, cough, wheeze and phlegm stand out, among others.For example, 81.0%,
The best indicator of the risk of COPD is dyspnoea when climbing one flight of steps (OR = 249.05;AUROC = 0.94), followed by dyspnoea when walking on level ground, cough daily, cough in the morning and wheeze, all with an AUROC > 0.7.
Table 2 shows the resulting predictive models for COPD when combining the different characteristics observed in the five data mining methods used.As expected, all include dyspnoea when climbing one flight of steps.These models are obtained after the application of the wrapper-based approach in the variable selection process.The selected variables are listed in table 3. The number of variables included in the final models ranged from 3 in the J48 decision tree to 6 in the PART decision list and logistic regression, with 5 in the case of the CART decision tree and JRip rule.
As an example of using table 2, consider an individual 40 years of age who has dyspnoea when climbing one flight of steps and a family history of asthma.If we apply the CART decision tree method we must classify the individual as non-COPD since "Dyspnoea when climbing one flight of steps = Yes", "Age group= < 50 years" and "Asthma family history = Yes".In particular, for the sample used, once applied the 10-fold cross validation, the 6 people who met these features were classified correctly, as shown in brackets (6/0) in that table.If we apply the J48 decision tree, we get that we must classify him as non-COPD since "Dyspnoea when climbing one flight of steps = Yes" and "Age group= < 50 years".In this case, of the 21 people in the sample who met the conditions, 16 were classified correctly and 5 incorrectly, as shown in brackets (16/5).For exemplification of using the logistic regression method further consider that the individual in question has no dyspnoea when walking, no cardiac diseases, no waking up drowning and smoking intensity equal to 10 pack-years.Applying the equation of table 2 gives that Logit = -5.09+ 5.15 -1.70 = -1.65,or equivalently, implying that the individual would be classified as Non-COPD with only a probability of 16% for COPD.
The predictive power of the five proposed methods is shown in table 4. The model validation is provided both on the total dataset and on the 10-fold cross validation.The values of sensitivity, false positive rate, F-measure, Kappa's coefficient and AUROC are shown in table 4a while the number of people classified as COPD and non-COPD based on actual values observed in the sample are shown in table 4b.

Discussion
This study shows that a simple tool consisting of symptom-based questions can be very useful in the identification of COPD patients with a smoking history in primary care.The results enable high efficiency prediction of COPD using a reduced number of factors which may be easily employed in the field of primary health care.
COPD shows high prevalence in smokers and many authors agree it is paramount to anticipate and improve diagnosis from primary care [6,8,9,21].In our study, a percentage of 11.2% individuals affected by COPD were obtained within smoking participants, none of whom had been previously diagnosed with respiratory disease.We thus consider that a simple questionnaire could be an important tool to obtain early diagnosis from primary care and under-diagnosis reduction.Other similar studies obtain higher percentages of underdiagnosed individuals in primary health care [6,7].However, Gingter, et al. [22] presented smaller values than ours.
There is controversy regarding the best way to face the problem of under-diagnosis and late diagnosis of COPD.The administration of questionnaires to the general population (active search), as well as to population consulting for any cause (opportunistic search) allows us to select a population with a higher risk of COPD and improves diagnostic performance of spirometry [23,24].
Using the results obtained in the 10-fold cross validation, all models have high sensitivity and AUROC.PART decision list presents

ISSN: 2469-5793
Rodríguez-Álvarez et al.J Fam Med Dis Prev 2016, 2:045 higher sensitivity is desired.We consider JRIP rule as a discriminatory tool to order a spirometry to be the most effective.We have selected this method for having the highest sensitivity and one of the best AUROC of the methods tested once evaluated by means of a 10-fold cross validation.In addition, this model selects only five variables: dyspnoea when climbing one flight of steps, dyspnoea when walking, phlegm daily for three months, smoking intensity and cough daily for three months.
We propose a simple predictive model with a series of items easy to obtain which have proven capable to identify patients with or without COPD, so it could be used in this level of health care to the worst values.JRip rule has the best sensitivity (0.911) and second best AUROC (0.932), only exceeded by logistic regression with an AUROC equal to 0.965.Of the 45 COPD present in the sample, 41 are correctly classified with JRip rule while 17 non-COPD will be wrongly classified as COPD.Thus, this method has the worst false positive rate (0.048).J48 decision tree has the best false positive rate (0.02) which has an influence on the possession of the best F-measure (0.818) and Kappa's coefficient (0.7958).We also found that of the 357 non-COPD only 7 are misclassified.
All methods have proved capable of discriminating individuals with or without COPD.However, considering that having COPD is the key prediction in this biomedical application, a classification method with but showed a negative association with COPD in the final model and chronic phlegm, while strongly associated with obstruction, identified less than 1% of those with a study diagnosis of COPD.
Of the three items related to cough, cough daily for three months is the only one that has been included in the final model.Price, et al. [6] indicates that cough is the most prevalent symptom in smokers, with or without COPD, so it does not present high discriminatory power.Freeman, et al. [7] include coughing occasionally or more often as a symptom with high discriminatory power to diagnose COPD.
The intensity of tobacco consumption expressed in pack-years is another item included in the model.Most authors agree the main risk factor for this disease is the intensity of tobacco consumption [25] and it appears in the screening models proposed by different authors [6][7][8][9]26] who include age as a predicting factor.In our study, although COPD presence increases significantly with age, it does not appear as selected item in the JRip model.

Limitations of the study
The voluntary nature of the participants who joined the study may not reflect the general primary care population and the value of COPD prevalence obtained may only be a rough estimate, although we consider that it may be a good indicator if a screening takes place in this environment.Taking into account the prevalence of COPD in the population, the selection criteria for the study force to work with very large samples, which is not always possible given the high rate of failure to attend scheduled appointments.
Development of such statistical tools will require an additional study, including prospective validation of items in an appropriate clinical setting and policy recommendations on the use of these predictor factors.

Conclusion
Our data confirm the presence of a high number of smokers with respiratory symptoms who are not diagnosed with COPD.
We propose a simple predictive model, with a number of easy to obtain items that have demonstrated capacity to identify patients with and without COPD.The five tested models work acceptably, and although one cannot find a method that is always the best for the classification of different datasets and criteria, JRip decision rule has been chosen to present the best sensitivity and AUROC, as well as maintaining a low false positive rate.Performance characteristics suggest that our questionnaire could be very useful in primary healthcare to enhance efficiency and diagnostic accuracy of current screening efforts using spirometry alone, so our model would be useful to improve the accuracy of early diagnosis of COPD in smokers with respiratory symptoms.

Table 1 :
Percentage of COPD and non-COPD in each of the population characteristics.

Table 4 :
Evaluation results, a) Coefficient; b) Number of people classified, of predictive models on the total dataset and after applying the 10-fold cross validation.