Detection of lung cancer with electronic nose and logistic regression analysis

Lung cancer is a very common malignancy with a low five-year survival rate. Artificial olfactory sensor (electronic nose) is a tool that recently has been studied as a probable optimal screening tool for early detection of lung cancer, but still no statistical method has been put forward as the preferable one. The aim of the study was to explore the use of logistic regression analysis (LRA) to analyse patients’ exhaled breath samples with electronic nose in order to differentiate lung cancer patients (regardless of the stage of the cancer) from patients with other lung diseases and healthy individuals. Patients with histologically or cytologically verified, untreated lung cancer, patients with other lung diseases such as benign lung tumors, chronic obstructive pulmonary disease, asthma, pneumonia, etc, and healthy volunteers were enrolled in the study, in total 252 cancer patients and 223 patients without cancer. Breath sample collection and analysis were performed with Cyranose 320 sensor device and data further analysed using LRA. The LRA correctly differentiated lung cancer patients from no-cancer patients. The overall sensitivity in detecting patients having cancer was 95.8% for smokers and 96.2% for non-smokers and the overall specificity was 90.6% for non-smokers and 92.3% for smokers. Exhaled breath analysis by electronic nose using LRA is able to discriminate lung cancer patients from patients with other lung diseases and from healthy individuals.


Introduction
Lung cancer is one of the leading malignant diseases in the world. According to the World Cancer Report 2014, the data from 2012 reveal that there were 1.8 million lung cancer cases detected worldwide (13% of all cases of cancer) [1]. Unfortunately, the mortality from lung cancer is high and the overall five-year survival rate is low, only 17% [1,2]. These numbers arise from the fact that in its early stages lung cancer usually evolves with no significant symptoms as well as with no suspected radiological changes. Thus lung cancer is mostly detected in already advanced stages and consequently linked with higher lethality. The survival rate for lung cancer increases if the cancer is diagnosed at an early stage. Only ∼15% of lung cancer patients are diagnosed at an early stage and more than half of the lung cancer patients have already died within a year from the detection of lung cancer [2].
None of the available diagnostic methods-nor computed tomography/positron emission tomography, nor fibrobronchoscopy with biopsy or sputum cytology-have been accredited as a reliable screening method, since each has its disadvantages. There is a lack of simple, cheap and widely available tool for early diagnostics of lung cancer, and it is essential to search for it. One such potential tool is the electronic smell sensor or electronic nose (e-nose). It has been tested in various specialities and divisions of medicine, not only the respiratory medicine, for more than two decades. Several data analysis methods have been applied to study the data gained with electronic nose. The respiratory medicine e-nose, combined with various statistical analysis methods, has been mostly used in studies exploring differentiation between chronic obstructive pulmonary disease (COPD), asthma, pneumonia and lung cancer. The results obtained have been promising. Still no statistical data analysis method has been found to be superior over Original content from this work may be used under the terms of the Creative Commons Attribution 3.0 licence. Any further distribution of this work must maintain attribution to the author(s) and the title of the work, journal citation and DOI.
others, yet the results may vary significantly depending on the method chosen. In a retrospective study that reviewed 73 studies on the analysis of volatile organic compounds (VOC) numerous different data statistical analysis methods were identified [3], while in the recently published European Respiratory Society Task Force publication on biomarkers in lung disease no current recommendations are given regarding the statistical analysis of VOC [4]. Regarding the minimum reporting standards for data analysis in metabolomics, there is no preferred algorithm and the preferred method in each case should depend on the problem [5]. The ERS Task Force recommends to use principal components analysis as a clustering method for exploratory data analysis [4]. Due to this ambiguity, we chose for our study to use the logistic regression analysis (LRA) and to explore the possibility of using LRA as a statistical tool for differentiation of the lung cancer patients from patients with other lung diseases and from healthy individuals, through analysing the data obtained from patients' exhaled breath with the artificial olfactory sensor (electronic nose).

Study patients
The patients and healthy volunteers were enrolled in a study at the Department of Lung Diseases and Centre of Thoracic Surgery at Pauls Stradins Clinical University Hospital in Riga, Latvia, from April 2011 to September 2013. Approval from the Ethical committee of scientific research at the University of Latvia, Institute of Experimental and clinical medicine, was received. All patients signed an informed consent.
The patients were divided in 'cancer' and 'no-cancer' groups and then further in 'smokers' and 'nonsmokers' groups (see below). In the 'cancer' group we included patients with verified lung cancer. In the 'nocancer' group we included patients with other lung diseases, as well as healthy volunteers.
For the 'cancer' group patients the clinical diagnosis of lung cancer was specified [6]. Patients with complications of lung cancer, like post-obstructive atelectasis, pneumonia, carcinomatous lymphangoitis, destruction of the tumor mass, etc were also included [7]. All lung cancer patients were newly diagnosed and had not received any prior specific anti-cancer therapy before the breath sampling.
In the 'no-cancer' group we included both healthy volunteers and: • patients with known COPD, diagnosed according to the Global Initiative of Chronic Obstructive Lung Disease (GOLD) Report, updated 2011, • patients with bronchial asthma, diagnosed according to the Global Initiative for Asthma (GINA) Report, updated 2011, • patients with histologically or cytologically verified benign lung tumors [7].
Exclusion criteria (for all patient groups): • patients who were unable to perform the manoeuvers necessary to gain the breath sample; • patients with uncertain anamnesis regarding possible lung diseases; • no clear diagnosis regardless of thorough investigations [7].
475 individuals were included in the study in total, 336 males and 139 females. We then divided them in two further groups (after classifying them as 'cancer' or 'no-cancer'). The first group ('non-smokers') included non-smokers and ex-smokers, 265 individuals in total. Ex-smokers had to have ceased smoking at least a year ago. In the second group ('smokers') we included only smokers, 210 individuals in total. The first, 'nonsmokers' group, consisted of 133 patients with verified lung cancer (from the 'cancer' group) and 132 patients from the 'no-cancer' group. The second group, 'smokers', consisted of 119 patients from the 'cancer' group and 91 individuals from the 'no-cancer' group.
The division of lung cancer patients according to the histological forms of the cancer type and according to the cancer stage is depicted below in figures 1 and 2.
The division of no-cancer individuals into disease groups is represented in figure 3.
Study subjects had to fill in a questionnaire, which contained questions about demographics, concomitant diseases and smoking history. After filling in the questionnaire, a breath sample was collected [7]. The study design is depicted in figure 4.

Exhaled breath sampling and analysis
Sampling and analysis of exhaled breath were done according to a standardized method published by Dragonieri, with some modifications [8].
Initially, patients breathed tidally activated carbon filtered (Nordic Safety, Norway) air with clipped nose, through a T-shaped two-way non-rebreathing valve (Hans Rudolph Inc., USA) for 5 min to clean the exhaled breath from ambient air pollution. The second step was inhalation to total lung capacity and full exhalation into polyethylene terephthalate bag. This was followed by immediate analysis with the artificial olfactory sensor device [7]. The approximate flow rate of the expiratory flow was 250-500 ml sec −1 , we did not separate the dead space sample. Later the analysis model in LRA was designed specifically for such sample acquisition model.
Exhaled breath analysis was done within 5 min after its collection with the artificial olfactory sensor device Cyranose 320 (Smith's Detection, USA). The cycle of exhaled breath sample analysis consisted of a 20 s long period of baseline ambient air registration, a   60 s long period of the analysis of the sample, a 5 s long interim period, when air sample was disconnected from the olfactory sensor, and a 180 s long rinsing cycle. During the analysis 32 sensor curves of electrical resistance were registered [7].
Statistical analysis of the data The data derived were statistically analysed using LRA (Statistica 7.0). As continuous predictors, we chose the relative maximum (R max ), area under the curve (AUC 0-60' ) and tgα 0-60' for each curve of 32 sensors. Age, smoking status (smoker, ex-smoker, non-smoker), smoking history and ambient temperature (t°C) at the moment of taking the air sample were considered additional predictor factors.
The LRA calculates the probability of each outcome falling into one or other group, and tries to maximize the likelihood and utility of the decision. It takes into account the constant and changing variables (figure 5) and the key is that the changing variables can only be expressed in one of the two modes (e.g., black/ white) and the model predicts the probability of the data falling into their respective groups [9].
LRA is about maximizing the probability of the data to be a part of each group. The method can be combined with additional methods, such as Kernel or Bayesian models [10].
When analysing the results, we calculated: sensitivity (number of true positives/(number of true positi-ves+number of false negatives)), specificity (number of true negatives/(number of true negatives+number of false positives)), PPV-positive predicted value (number of true positives/(number of true positi-ves+number of false positives)), NPV-negative predicted value (number of true negatives/(number of true negatives+number of false negatives)) [7,[11][12][13][14] Data models for LRA We developed two data models. The first model

Results
The total number of 'cancer' patients was 252 and the total number of 'no-cancer' patients was 223.
In In the test sample of the next 100 investigated patients (after those whose results we analysed and showed above), 82 cases were predicted correctly.
For descriptive statistics of comparison between 'cancer' and 'no-cancer' patients regarding smoking exposure in pack-years, see tables 3 and 4.
Mann-Whitney U test showed a significant difference between the 'cancer' and 'no-cancer' groups regarding their smoking exposure in pack-years (p<0.0001) (table 5), while t-test found no significant difference in patients' height and weight (table 5).

Discussion
We would like to assume that we have gained a very good sensitivity (95.8% in 'smoker' and 96.2% in 'non-smoker' groups) and specificity (92.3% in 'smoker' and 90.6% in 'non-smoker' group) markers. Both the overall sensitivity and specificity of the LRA were slightly impacted by the patients being 'non-smokers' (non-smokers and ex-smokers) or smokers. Both markers in both groups as well as PPV and NPV were all above 90%. It means that by using LRA, we could correctly classify patients as having or not having lung cancer with more than 90% precision.
Regarding the studies in respiratory medicine where LRA was used to analyse data obtained with electronic nose, we have to admit that there are only a few studies where LRA was used at all, not even limiting our search to studies devoted to the detection of lung cancer with e-nose and LRA. Thaler with colleagues used electronic nose and analysed patients' samples trying to distinguish biofilm-producing from non-biofilm-producing Pseudomonas aeruginosa and Staphylococcus aureus strains (two species each). Binary classification of the bacteria (biofilm producing versus non-biofilm producing) by logistic regression was performed with various sets of data (e.g., taking into account the days of illness). The testing accuracy varied between 80.6%-100% for both Pseudomonas aeruginosa species and 72.2%-91.7% for both Staphylococcus species [15].
The use of LRA in probability of possible airway bacterial colonization of patients with COPD was studied by Sibila and colleagues. The air over cultures of specimens of clinically stable COPD patients and healthy controls was sampled using electronic nose and afterwards analysed with LRA. The accuracy was 88% for colonized and 83% for non-colonized patients [16].
The use of electronic nose together with logistic regression method analysis in detecting head and neck squamous cell carcinoma was studied by Leunis and colleagues. The conducted study was rather small-23 patients. The sensitivity gained with the chosen method was 90%, specificity-80% [17].
A different type of electronic nose sensor technology was used in another study by Thaler and colleagues. They sought to find a reliable difference between groups of patients with or without chronic bacterial sinusitis. Just as we had in our study, usually the metal oxide sensor e-nose is used in medical studies, but in this study the colorimetric sensor arrays were chosen. Nevertheless, the data obtained were later analysed with logistic regression method and the accurate classification rate found was 90% [18].
Schnabel et al used LRA to try to detect ventilatorassociated pneumonia and analysed samples from 72 patients. The sensitivity was 88% with a specificity of 66% when it was used to distinguish VAP patients from the control group, but the values fell to a sensitivity of 76% with a specificity of 56% when it was used to distinguish between patients with clinical signs of VAP. The final diagnosis adjusted even further after receiving the results of the bronchoalveolar lavage [19].
We performed PubMed database search on 20th August, 2017, searching for studies where electronic nose could have been used together with LRA in the detection of lung cancer, but could not find any relevant study.
As we have already mentioned, e-nose can be and in some studies has been used together with various methods of statistical analysis, yet until now there is no superior recommended method to be used when analysing data obtained with it. Various statistical methods have been used in analysing data obtained via electronic nose and aimed at detection of lung cancer, all of which show good results. One of the first publications regarding detection of lung cancer with e-nose was published in 2005 by Machado et al. They conducted a study where they used a support vector machine and obtained 91.9% specificity with 66.6% positive predictive value and 93.4% negative predictive value regarding the detection of lung cancer [20].
Another study was published in 2009 by Dragonieri et al, who used canonical discriminant analysis. Their cross-validation results were 85% correct when distinguishing lung cancer from COPD, and crossvalidation value 90% correct when distinguishing lung cancer from healthy controls [21].
When using LRA, there is an option to choose the variables and to explore their effect on results. We have included ambient temperature as a predictive factor because in real life when we were gathering patients' breath samples during the summer, as the indoor temperature exceeded 36°C, we noticed involuntary changes in data registration and those data could not be included in the study. We therefore believe that if the ambient temperature starts to approach the e-nose's internal temperature, the e-nose and its sensors could fail.
We could say that our results have a high success rate in detecting lung cancer but we have to stress once again that there is practically no data regarding the use of electronic nose and LRA to compare our results  with others, because majority of authors have chosen other data analysis methods. As we were focussing on the detection of lung cancer, we did not analyse patients' data regarding the link of other diseases with specific breathprint pattern. The number of the patients in each 'non-cancer' disease group was too small to purposefully analyse them and we did not do that, aiming at analysis of lung cancer specifically. Just as well we did not analyse differences between different cancer stages and different histological types. We assume that would be a worthwhile and valuable further investigation, even though in one of the studies where histological type was taken into account, no differences were observed [22].
We admit that we have not taken into account the possible influence of medications taken by the study subjects. As the medication groups used by the study subjects were various and most of them used not only one medication, but combination of medications covering different pharmaceutical groups, it would be incorrect to start analysing the possible influence of medications as the number of patients using each separate medication would be quite small and proper analysis could not be possible as each different pharmaceutical group could have a different impact on the data results and if studied, they should be parted in appropriate groups. There were no patients in the 'cancer' group that had end-stage disease and were severely ill. In healthy volunteers group the patients were not taking any medications. We did not ask the patients about their use of over the counter medications. For further studies in larger patient groups it could be possible to analyse the possible influence of the used medications.
Just as well additional attention and further recommendations are needed for the optimal rate of exhalation flow. Optimal expiratory flow rate has been discussed since, but when we started our study and set the methodology there were no data available regarding that. As was established by Bikov et al, the expiratory flow rate influenced the results of healthy individuals, but not the results of lung cancer patients [22].
A thorough analysis of studies that have used e-nose as their diagnostic tool has been carried out recently to explore the impact and differences arising from different data validation and analysis systems [23], covering 46 prior studies. The authors found that no classification or analysis method has resulted in consistent results of the analysed training set, even after internal validation, and proposed to use external validation as well. This comparative explorative study was not carried out at the time when we were working with our patients and the number of the completed and published e-nose studies was much smaller then than it is now. Thus, we have not performed such a thorough analysis with both wide internal and external validation groups ourselves, but we strongly agree that this should be encouraged in further studies.
Regarding LRA as a subgroup of principal component analysis, it is a possible tool to be used and an optimal one taking into account that in LRA one can choose to include or exclude the factors that essentially affect the results. Those might be either factors that depend on the sensor detector, chemical substances, etc or a combination of two or more factors that can be later avoided and the analysis formula altered, taking those into account. Just as well there might be different approaches to the analysis of each and every factor, e.g., relative maximum, area under the curve, tgα or other. The impact might arise from exhalation rate as well. LRA is good in the way that it is kind of a selflearning tool that might be improved with time, use and problems encountered. With every following patient the model is learning, adapting and calculating the optimal variant to analyse the incoming data. Thus detector changes over time and other similar problems might be avoided with the use of a self-learning model. Though studies regarding the use of e-nose in the detection of lung cancer do not cease and are delivering more and more promising results, there still are some challenges [24,25]. These are linked with both e-nose device and sensor properties, such as humidity, temperature, sensor stability etc, as discussed above, and the study design/patient selection properties, as well as the choice of statistical methods. We are looking forward for more recommendations regarding optimal statistical analysis system to be used for data obtained with electronic nose from patients' exhaled breath samples in order to detect lung cancer.

Conclusions
LRA is a good tool to be used with high specificity and sensitivity in the detection of lung cancer. The results obtained with the means of LRA vary slightly depending on the chosen variables. The results of exhaled breath analysis depend on the chosen statistical method.