Infrared cavity ring-down spectroscopy for detecting non-small cell lung cancer in exhaled breath

Early diagnosis of lung cancer greatly improves the likelihood of survival and remission, but limitations in existing technologies like low-dose computed tomography have prevented the implementation of widespread screening programs. Breath-based solutions that seek disease biomarkers in exhaled volatile organic compound (VOC) profiles show promise as affordable, accessible and non-invasive alternatives to traditional imaging. In this pilot work, we present a lung cancer detection framework using cavity ring-down spectroscopy (CRDS), an effective and practical laser absorption spectroscopy technique that has the ability to advance breath screening into clinical reality. The main aims of this work were to (1) test the utility of infrared CRDS breath profiles for discriminating non-small cell lung cancer (NSCLC) patients from controls, (2) compare models with VOCs as predictors to those with patterns from the CRDS spectra (breathprints) as predictors, and (3) present a robust approach for identifying relevant disease biomarkers. First, based on a proposed learning curve technique that estimated the limits of a model’s performance at multiple sample sizes (10–158), the CRDS-based models developed in this work were found to achieve classification performance comparable or superior to like mass spectroscopy and sensor-based systems. Second, using 158 collected samples (62 NSCLC subjects and 96 controls), the accuracy range for the VOC-based model was 65.19%–85.44% (51.61%–66.13% sensitivity and 73.96%–97.92% specificity), depending on the employed cross-validation technique. The model based on breathprint predictors generally performed better, with accuracy ranging from 71.52%–86.08% (58.06%–82.26% sensitivity and 80.21%–88.54% specificity). Lastly, using a protocol based on consensus feature selection, three VOCs (isopropanol, dimethyl sulfide, and butyric acid) and two breathprint features (from a local binary pattern transformation of the spectra) were identified as possible NSCLC biomarkers. This research demonstrates the potential of infrared CRDS breath profiles and the developed early-stage classification techniques for lung cancer biomarker detection and screening.


Introduction
As the deadliest and most prevalent form of cancer worldwide [1], lung cancer requires urgent, effective intervention.Unfortunately, the disease is most often discovered in its later stages when treatment options are limited and prognoses are poor.For instance, the Surveillance, Epidemiology, and End Results (SEER) Cancer Statistics Review reported that five year survival rates drop from 57.4% for localized cases to 30.8% for cases with regional nodal involvement, and further to 5.2% for cases with distant metastases [2].Early lung cancer detection is therefore crucial, yet widespread screening programs remain elusive.
While some health organizations recommend low-dose computed tomography (also called a lowdose CT scan, or LDCT) for this purpose, the technology's costs and tendency for overdiagnosis are significant hindrances [3].This unmet need for an affordable and practical screening method has prompted a wave of research into breath-based solutions predicated on volatile organic compounds (VOCs).An exhaled breath sample may consist of thousands of VOCs, produced either exogenously or endogenously.Those of endogenous origin offer insight into the metabolic processes in the body, thus enabling the detection of an anomaly such as lung cancer through non-invasive breath sampling [4].
Once collected, the most common approach to breath sample analysis is through a form of mass spectrometry, most often gas chromatography-mass spectrometry (GC-MS) [5].GC-MS is popular because it offers highly accurate VOC identification and quantification in a sample, enabling the characterization of lung cancer by the presence, increase, or decrease of particular VOCs in the breath.However, there is very little agreement across studies of this type in the VOCs found to be lung cancer biomarkers [6,7].A systematic literature review by Saalberg and Wolff [7] listed 77 different VOCs identified as potential biomarkers for lung cancer from 52 research articles (from 1985 to 2015), but the most frequently confirmed biomarkers were each found only five times (i.e.10% of the studies).The inconsistencies and contradictions in identified biomarkers can be attributed, at least in part, to the absence of standardized methods for sampling, detection, and statistical analysis as well as the abundance of confounding factors in breath analysis.Among others, significant confounders include environmental conditions at the time of collection, age, diet, smoking habits, gender, and comorbidities [6].Further, technologies like GC-MS require time-consuming, complex sample preparation and operation by experts, prohibiting online applications and limiting their effective use to laboratory research [8,9].
Motivated by the cost effectiveness, portability, and speed offered by sensor-based technologies, another approach to breath analysis using electronic noses (e-noses) has recently emerged [10,11].A wide assortment of sensor types have been adopted for lung cancer breath research, including acoustic, quartz micro-balance, and metal oxide semi-conducting sensors [12,13].Unable to provide specific VOC concentrations, systems using e-nose technology rely on patterns in the sensor array's 'breathprint' for detecting disease.Therefore, unlike the traditional VOC extraction approach to breath analysis, machine learning techniques are often used to extract meaningful features directly from the sensor response.To date, e-noses have had promising results in capturing underlying disease signatures with this approach [10,11].Despite their advantages, e-nose sensors tend to be limited by low sensitivity, frequent calibration requirements, susceptibility to environmental factors like humidity and temperature, drifting, and memory effects [12,14].
Though comparatively less common in breath research, laser absorption spectroscopy (LAS) is an attractive alternative to MS and e-nose technologies.In recent years, advances in analyzer hardware and laser sources have progressed LAS techniques to a degree comparable to even GC-MS in sensitive, effective breath profiling [14], while maintaining low costs, quick analysis times and the ability to be operated by non-experts [15].Like e-noses, these optical techniques can often be adapted for online analysis, eliminating the need for sample storage.However, LAS techniques also generally outperform e-noses in terms of sensitivity and robustness [15].For all of these reasons, LAS breath tests are well-suited for the real-world conditions encountered in routine clinical practice, and there are numerous commercially available, US Food and Drug Administration (FDA)-approved breath tests that use portable optical technologies [15].
In this work, non-small cell lung cancer (NSCLC) detection systems were developed using mid-infrared breath profiles obtained through cavity ring-down spectroscopy (CRDS), an ultra-sensitive form of LAS.CRDS is based on the decay of a laser pulse as it bounces back and forth between two highly reflective mirrors in a cylinder enclosing the sample of interest.The decay rate (or ring-down) of the pulse reveals the sample's light absorption.CRDS is especially practical and robust for routine clinical use as it is essentially calibration-free and, because the measurements are time-based, it is immune to fluctuations in laser intensity [16].Further, due to the increased effective path length of the light in the cavity, CRDS affords even higher sensitivity than other forms of LAS [17].Hence, the first purpose of this study was to evaluate the potential of CRDS for detecting NSCLC through statistical and machine learning analysis of exhaled breath samples.The second purpose of the study was to compare the detection capability of models based on VOC predictors and models based on spectral breathprint predictors.Lastly, based on the lack of consensus found in biomarkers across studies in the literature, the third purpose of this study was to propose a robust method for identifying potential NSCLC breath biomarkers.

Data collection
One hundred biopsy-confirmed lung cancer patients and 98 non-cancer control subjects were enrolled in the study to provide breath samples.The control participants had no known active cancer or history of lung cancer, and consisted of clinical staff and friends or family of the lung cancer patients that accompanied them on the day of their collection.All subjects provided informed consent as per the Horizon Health Network's Research Ethics Board (#100099), and analyses of the collected data were conducted as approved by the University of New Brunswick's Research Ethics Board (#2019-068).Collection was performed at three hospitals using an exhaled breath sampler developed by Breathe Bio-Medical [18], which tracks CO 2 levels to collect alveolar breath into Tenax TA sorbent tubes.Before their shipment to sample collection sites, all sorbent tubes were conditioned and batch tested to ensure low background levels.To minimize confounding effects of environmental VOCs, the lung cancer and control subjects performed the collection in the same rooms (one at each collection site).Subjects were also asked to abstain from smoking for 4 h and drinking alcohol for 8 h prior to collection, where they were instructed to breathe deeply and exhale into a single-use filter on the sampler's mouthpiece until 10-litre (10 l) samples were amassed (approximately 30 min).Apart from shipment periods, collected samples were maintained at −20 • C until analysis.
Post-collection, the inclusion criteria for lung cancer subjects were amended to exclude eleven otherwise eligible patients with ambiguous or smallcell histologic subtypes.Given the limited representation in the dataset for these subtypes, the analysis was focused on NSCLC patients which constitute approximately 85% of the lung cancer population [2].Also disqualifying subjects that had missing data (for example, if they were unable to provide the full 10 l sample) and patients that had undergone any form of lung cancer treatment, the remaining 62 pretreatment NSCLC patients and 96 non-cancer control subjects were included in the analysis.
A comparison of demographics and clinical factors for the two cohorts is provided in table 1.These include age, sex, COPD assessment test (CAT) score (describing the extent of a subject's breathing symptoms, 0 at best and 40 at worst [19]), smoking habits, presence of other lung conditions, number of hours since last food intake, and histologic subtype for the NSCLC patients.Data for these factors were available for all subjects, aside from smoking habits which were known for 152 of 158 subjects.The presented group proportions are corrected for this missing data.

CRDS measurements
Infrared breath profiles were measured by Breathe BioMedical for each of the 10 l samples using CRDS.Two narrow linewidth (100 kHz) CO 2 lasers with carbon isotopes 12 C and 13 C were tuned to a combined 73 lines in the mid-infrared region, a favourable spectral range for the detection of small molecules [8].Mirrors of reflectivity ⩾99.8% increased the effective path length of light in the cavity to approximately one kilometer, attaining ring-down times of roughly one to two microseconds in the region of interest (9.2-11.3µm).At each wavelength, the average times from 500 ring-downs were measured for the breath sample (τ ) and for a baseline nitrogen sample (τ 0 ).The absorption coefficients K comprising each spectrum were calculated from the average ringdown times according to (1), where c is the speed of light: The analysis was performed four times for each sample, at desorption temperatures of 75 • C, 150 • C, 225 • C, and 300 • C, yielding four different spectra per subject.Certain volatiles require higher temperatures than others to be fully released from the Tenax TA adsorbent material, so the four spectra characterize different assortments of VOCs.The multi-desorb procedure ensures that (1) high-absorption VOCs with low boiling points (such as alcohols) do not overwhelm other potentially important VOCs, and (2) ring-down times are not below the detector's sensitivity, which can occur if absorption is too high.Figure 1 depicts the four spectra obtained for one random subject.

VOC concentrations
A stepwise linear fitting algorithm [20] was used for quantitative estimation of the concentrations of the compounds present in each CRDS spectrum.Using a reference library of 152 absorption crosssections from the Quantitative IR Database provided by Pacific Northwest National Laboratory [21,22] and the HITRAN database [23], compounds were iteratively added or removed from the model based on the Akaike Information Criterion (AIC) for each potential fit given by (2).RSS represents the sum of squares of the fit residual, N y is the number of measurements in the spectrum, N x is the number of compounds in the fit, and AIC k is a parameter used to control the balance between the goodness of fit and entropy, fixed at 2 for this study: The stepwise procedure was stopped when the AIC could not be further improved by adding or removing a compound from the fit.To create the final VOC-based feature matrix for classification, the VOC concentrations (in ppbV) resulting from the fitting algorithm were log-transformed.

Spectral breathprints
The spectral breathprint features were derived directly from the CRDS absorption profiles through pre-processing and feature extraction methods.More details about spectral data pre-processing can be found in our preliminary study [24].Due to the high variability in breathprints observed across subjects, in this study, the crafted features emphasized the shape of the spectra rather than individual absorption values.Specifically, the first order and second order spectral derivatives were extracted to highlight peaks and troughs in the spectra.Along with the raw spectra, these derivatives were then transformed using a onedimensional local binary pattern (1D-LBP) feature extraction technique [25].Briefly, a nine-point moving window was used to create a series of eight-bit binary LBP codes for each input spectrum or derivative, where ones represent points in the window that are greater than its center value and zeros represent points that are smaller.In this way, the LBP codes describe the structure of the input by capturing relationships between neighboring points.These LBP codes were further transformed by counting the frequencies of the 256 possible patterns to produce a set of histogram features.In this study, only the histogram features corresponding to the 58 'uniform' eight-bit patterns, those which contain at most two bitwise transitions from zero to one or the reverse, were included in the final feature matrix.These uniform patterns are less likely to capture random processes than the more complex, less frequent non-uniform patterns, and have been shown to contain the most discriminative information [26].For the remainder of the paper, breathprint features are denoted by the spectrum type (raw or derivative: 1st or 2nd) and the 8-bit LBP code in its decimal representation.

Classification models
For both VOC-and spectral breathprint-based feature matrices, the same feature selection and classification techniques were used in developing the models.Following feature extraction, feature selection is a critical step for identifying potentially useful predictors among many irrelevant and correlated features that can hinder performance.The feature selection in this work comprised of two steps: (1) ranking features using the minimum redundancy maximum relevance (mRMR) algorithm [27], and (2) optimizing the number of top-ranked features to include in the model using classification performance.With mRMR, features are sequentially selected based on their correspondence to the class membership labels and distinction from other features.Specifically, in this work, the ranking criterion was a feature's Pearson correlation with the class labels minus its average Pearson correlation with more highly ranked features.Once ranked, the top features were added one at a time to a candidate feature set that was evaluated through training and validation of a classification algorithm using leave-one-out cross-validation (LOOCV).To limit model complexity, only feature sets with fewer than half the number of samples were assessed (a 2:1 ratio of samples to features).Among the candidate feature sets, the optimal set for the given samples was that which provided the best balance between sensitivity (true positive rate) and specificity (true negative rate).This was based on the distance D from the optimal (0,1) corner of a receiver operating characteristic (ROC) curve [28], given by (3): Both intermediate and culminating classification models (those used for tuning the number of features and those used for finding final performance estimates, respectively) used a linear support vector machine (SVM) algorithm for learning class associations.Using features from a labeled set of training samples, the SVM constructs a hyperplane in the feature space to best separate the two classes.The algorithm attempts to maximize the margin around the hyperplane (i.e. its distance from samples in either class) which then acts as a decision boundary for classifying future samples.The margin can be adjusted by tuning a penalty factor for incorrect misclassifications, c.As preliminary studies indicated that a value of c = 1 performed well for both the VOC and spectral breathprint features, this parameter was fixed at 1 for all models to maintain low computation times and avoid over-tuning.

Performance evaluation
To provide a contextualized look at each model's classification ability and enable comparison to previous studies that used smaller sample sizes, a learning curve approach was used for performance evaluation.A learning curve depicts a model's classification performance at several incremental sample sizes, as it is able to increasingly learn the relevant patterns from the features.Starting with a subset of ten subjects, random progressive sampling was used to create a series of datasets ranging up to the maximum sample size (158 subjects) in increments of ten.Equal class sizes were maintained in each subset until sample sizes exceeded 124, where only control subjects remained.For each sample size, the models were redeveloped and validated using the techniques in section 2.5 to obtain a series of empirical performance estimates.This was repeated over ten iterations for both the VOC-and breathprint-based models and the learning curves were averaged.
Because models trained with small sample sizes can be prone to overfitting, where noise in the data is learned in addition to or rather than relevant patterns, two different approaches were used for obtaining the performance estimates: (1) non-nested LOOCV and (2) nested LOOCV.Both frameworks are based on cross-validation (CV), a data-efficient resampling method for evaluating a classification model's performance.For the CV procedure, available samples are randomly partitioned into k sections, or 'folds' , of approximately equal size.The samples from the first fold are assigned to a testing group and the samples from all remaining folds are assigned to a training group.The training samples are used for learning classification parameters, and the test samples are used to evaluate the trained classifier.In turn, each fold is used once as the test set, and a final performance estimate is found by averaging the classifier's performance over all k folds.LOOCV is a special case of CV in which each fold consists of a single sample (k is equal to the number of samples).
Figure 2 illustrates how the non-nested and nested LOOCV frameworks incorporate model development steps like feature ranking and feature number optimization into the CV procedure.With nonnested LOOCV, the model development is performed once using all samples, resulting in a single optimal feature set that is fixed during model validation.Although the classification parameters change for each CV fold, the selected features do not.Contrarily, with the nested LOOCV framework, the model development steps are performed using only designated training samples.This means that model development is repeated for each training set during the CV procedure, resulting in multiple feature sets.
Though neither framework is ideal alone, nonnested and nested CV provide estimates of the upper and lower bounds of a model's performance [29].Overfitting is a problem with small sample sizes because strong noise-based differences between groups may appear by chance, and often perfect class separation may be achieved by exploiting these spurious patterns.High variance, overly complex models that do not generalize well to new samples are therefore a concern.With the non-nested approach, because model development steps include information from test subjects, this overfitting can inflate performance estimates for the test set.On the other hand, with the nested approach, overfitting to training samples will reduce performance for the independent test samples, resulting in more conservative (even pessimistic) performance estimates.Hence, the two approaches are useful in tandem for observing the range of estimates.

Robust biomarker identification
For the final experimental aim, a robust consensusbased procedure was used to identify possible NSCLC biomarkers from the VOC and breathprint feature sets.Inspired by the consensus nested crossvalidation (cnCV) method [30], the predictors that were selected most frequently during the learning curve procedure were considered 'consensus features' .In brief, the identification of consensus features, embedded into the learning curve performance estimation, was performed as follows: (a) Select a random subset of N samples from the dataset.(b) For the given subset, apply the CV and nested CV frameworks to compute optimal feature sets and assess their classification performance.For the non-nested framework, there will be a single feature set.For the nested framework, there will be N feature sets.different random subsets of samples.(f) For each sample size, identify the features that appeared in at least T% of iterations from the non-nested CV and fold-consensus nested CV feature sets.This will create two 'iterationconsensus' feature sets.(g) Identify the features from the iterationconsensus feature sets that were selected for at least T% of sample sizes to establish the final sets of potential biomarkers.
In the present investigation, a threshold T of 68% was selected, as the empirical rule states that 68% of the data observed following a normal distribution lies within one standard deviation of the mean.Further, because high discord was expected for selected feature sets in the smallest tested sample sizes, sample sizes lower than 40 were disregarded when determining the sample size consensus in Step (g).This same procedure was used for both the VOC and spectral breathprint feature sets.
The potential biomarkers were lastly assessed through statistical means, to (1) compare each consensus feature between the NSCLC and control groups, and (2) examine each feature's association with various confounding factors: sex, smoking habits, age, CAT score, and food intake prior to collection.The second was performed by dividing the participants into binary subgroups so that twogroup comparisons could be performed (e.g.male vs. female, younger vs. older, low CAT score vs. high CAT score).Due to the zero-inflated distributions for the VOC features, a robust two-part Wilcoxon test was implemented for each comparison as described by Gleiss et al [31].Two tests were performed: (1) a χ 2 test to compare the proportions of zero-concentrations between groups (i.e. the number of subjects for which the compound was absent or unable to be detected), and (2) a nonparametric Wilcoxon rank sum test to compare the non-zero values between groups.The two-part test statistic χ 2 (2) , which follows a χ 2 distribution with two degrees of freedom, was given by ( 4) where χ 2 (1) is the continuity-corrected test statistic from the χ 2 test and U is the continuity-corrected and normalized test statistic from the Wilcoxon rank sum test [32]: In cases where one group consisted of only zero values, p-values were derived from a χ 2 distribution with one degree of freedom using only the χ 2 (1) test statistic.For the breathprint features, which were count variables, standard Wilcoxon rank sum testing was used for all comparisons.

Results
The learning curves (based on sample sizes N = 10, 20, . . ., 150, 158) for the VOCs and the spectral breathprints are shown in figure 3. The VOCbased model performance estimated at the maximum sample size (N = 158) was 85.44% accuracy (66.13% sensitivity, 97.92% specificity) using the non-nested framework, and 65.19% accuracy (51.61% sensitivity, 73.96% specificity) using the nested framework.The non-nested model was based on a fixed feature set of 19 VOC predictors while the nested models used an average of around 27 predictors (17-50) across CV folds.For the spectral breathprint-based models, the final performance estimates were 86.08% accuracy (82.26% sensitivity, 88.54% specificity) with the non-nested framework and 71.52% accuracy (58.06% sensitivity, 80.21% specificity) with the nested framework.The non-nested model used 24 breathprint predictors and the nested model used an average of around 28  predictors.It should be noted that the 95% confidence intervals generally narrow as sample size increases as there is more overlap in subjects across the ten iterations.At the maximum sample size, only one 'subset' is possible and therefore only one iteration was performed.
The potential biomarkers identified during the consensus procedure for each feature type (VOC-based or spectral breathprint-based) and validation framework (non-nested or nested CV) are shown in figure 4. Using the VOC approach, the features from the non-nested CV procedure that fulfilled the consensus criteria were the log-transformed concentrations for dimethyl sulfide (identified at a desorption temperature of 75 • C), isopropanol (75 • C), and butyric acid (150 • C).The nested CV results corroborated both the dimethyl sulfide (75 • C) and isopropanol (75 • C) findings.For the breathprint approach, both nested and non-nested techniques yielded the same two consensus features: the frequency of LBP 31 (in binary, 00011111) in the raw 75 • C spectra and the frequency of LBP 28 (00011100) in the raw 225 • C spectra.It should be noted that although these consensus features were found to be the most prominent among all the developed models, additional features contributed to each of the learning curve estimates in figure 3. Broader feature sets were necessary for capturing the subtle, secondary patterns that enabled effective discrimination of the two classes.
Table 2 presents the results of the comparisons between the NSCLC and control cohorts for each of the potential biomarkers.Two-part Wilcoxon tests were used for the VOC features and standard Wilcoxon rank sum tests were used for the breathprint features.In each case, statistically significant differences were found (p < 0.05).For the VOC features, the two-part tests indicated consonant increases (i.e.increases in both the proportion of non-zero values and in the concentrations of these non-zero values) of dimethyl sulfide (75 • C) and isopropanol (75 • C) in the NSCLC group.Butyric acid (150 • C), which was only present in the control group, was found to be significantly decreased in proportion for the NSCLC group.For the breathprint features, raw LBP 31 (75 • C) was significantly increased and raw LBP 28 (225 • C) was significantly decreased in the NSCLC cohort.
Comparisons were also performed for various subgroups in table 3 to test each feature's association with possible confounders.Few significant differences were found between subgroups.A dissonant difference was found for isopropanol (75 • C) due to age, with an increase in instances but lower concentrations in the older group compared to the younger group (p = 0.04).Butyric acid (150 • C) exhibited a consonant difference due to food intake, with an increase in instances and higher concentrations for participants that had eaten within 3 h prior to collection.For the spectral breathprint features, the raw LBP 28 (225 • C) feature was found to be decreased (p = 0.003) in former smokers compared to neversmokers as well as in subjects with high CAT scores compared to low CAT scores (p = 0.02), and the frequency of raw LBP 31 (75 • C) was increased for active smokers compared to non-smokers (p = 0.01).Note: NSJ represents the number of subjects with a non-zero value for the feature, and the mean (µ) and standard deviation (σ) include only those non-zero values.Features with zero values represent the absence of a compound in a spectrum (for the VOC approach) or the absence of a given LBP in a spectrum or spectral derivative (for the breathprint approach).

Screening using cavity ring-down spectroscopy
The first purpose of this work was to determine the utility of CRDS, an LAS technique, for detecting NSCLC through the analysis of exhaled breath profiles.The non-nested and nested CV accuracy estimates were 85.44% and 65.19% for the VOC features and 86.08% and 71.52% for the breathprint features, respectively.The disparity in non-nested and nested estimates can be attributed mainly to overfitting effects, although some pessimism is inherent for the nested estimates due to the reduced information during feature selection [33].By using all samples in model development, however, as with the non-nested framework, the models were able to find the most convenient noise-based features in addition to truly relevant patterns.This was especially evident in the smallest sample sizes (N < 40) where perfect or nearperfect accuracies were achieved with this method.Eventually, with sufficient sample size, the two types of estimates should converge to a value within the indicated performance range (figure 3), reflecting the true underlying quality of the features.
Based on the performance estimate ranges for the current classification models, the discrimination ability of CRDS breath profiles is at least on par with many other screening technologies.In a large population screening study, chest CT detected lung cancer with 55% sensitivity and 95% specificity [34].For breath-based systems, studies employing a variety of spectrometry and e-nose technologies from 1985 to 2020 have reported accuracies ranging from 60% to 100% (sensitivities from 50% to 100% and specificities from 12% to 100%) in discriminating lung cancer subjects (some including small-cell subtypes, others NSCLC only) from non-cancer controls [11,35].It is important, however, to interpret and compare these reported classification performance estimates with caution.In addition to study designs, sample collection protocols, and measurement techniques, which vary greatly across studies, the employed statistical and machine learning schemes are critical to understanding the performance estimates.In fact, many studies reporting very high accuracies employed nonnested CV or other validation frameworks that can yield overly-optimistic estimates [11].Lack of correction for overoptimism is a key methodological shortcoming in many breath-based lung cancer detection studies.For instance, Westhoff et al [36] reported perfect LOOCV accuracy in distinguishing 32 lung cancer patients from 54 healthy controls using ion mobility spectroscopy (IMS) peaks, but they used information from both training and test subjects for tuning the optimal feature set.Bajtarevic et al [37] similarly used all available subjects for selecting VOC predictors to develop their decision rule, which achieved 80% sensitivity and 100% specificity for a dataset of 65 lung cancer and 31 control subjects.Contrarily, few studies employ nested CV or other techniques that prevent information leakage, such as hold-out validation.This was the case for Mazzone et al [38], who used an independent 30% subset of their 49 NSCLC and 94 non-cancer control subjects for validation, achieving 73.3% sensitivity and 72.4% specificity with colorimetric sensor array measurements.Also considering sample size and model complexity (associated with the adopted classification parameters and the number of selected features), which can exacerbate the contrast between estimates from different validation frameworks, direct comparisons across reported performance estimates are not possible.
Despite the abundance of research in exhaled breath analysis for lung cancer detection using spectroscopic and sensor-based techniques, few studies have used LAS for this purpose.Skeldon et al [39] used tunable diode laser absorption spectroscopy (TDLAS) for measuring ethane, an accepted marker of oxidative stress, in the exhaled breath of 12 lung cancer patients and 12 matched controls.Notably, they did not find a significant difference Note: NSJ represents the number of subjects with a non-zero value for the feature, and the reported mean (µ) and standard deviation (σ) include only those non-zero values.
between the two cohorts, concluding that a singular non-specific marker such as ethane is unlikely to provide sufficient evidence of a particular pathological condition such as lung cancer.Mitrayana et al [40] used laser photoacoustic spectroscopy (LPAS) for comparing acetone in the breath of 11 lung cancer patients and two control populations, consisting of 10 healthy volunteers and 9 patients with other lung diseases.They found a significant increase in acetone for lung cancer patients compared to the reference groups, but recommended further analysis because measurements for all populations fell within the expected concentration range of acetone in normal breath [41].Both studies were limited by very low sample sizes and targeted approaches: since biomarkers may be associated with many diseases, and moreover, a disease may be characterized by several biomarkers, an individual's entire composite breath profile may be necessary for lung cancer detection.
The CRDS system presented in this work is novel to the field of lung cancer detection.As an LAS technique, it provides low-cost, quick, and accurate breath profiling that can be performed by non-experts.However, unlike other LAS techniques like TDLAS and LPAS, CRDS does not require frequent calibration and measurements are unaffected by irregularities in laser intensity [16].The considerable path length of the light in the cavity (approximately one kilometer in this work) also enables comparatively higher sensitivity than TDLAS and LPAS [17].For a breath sample, CRDS therefore provides an ultra-sensitive, highly reproducible set of measurements.Moreover, unlike Skeldon et al [39] and Mitrayana et al [40], the CRDS system in this work measured absorptions for a wide range of infrared wavelengths.By broadening the investigation to include the entire VOC composition of a sample, rather than targeting individual presumed biomarkers, the present CRDS analysis was able to better capture distinguishing signals.

VOC features vs. spectral breathprint features
The second purpose of this study was to compare models trained with VOC features to models trained with spectral breathprint features.While the nonnested CV performance estimates were very similar for both sets of features (within 3% difference in mean accuracy at all sample sizes), the more conservative estimates from the nested framework were generally higher for the breathprint features (figure 3).In fact, using all 158 samples, an improvement in nested LOOCV accuracy of over 6% was observed compared to the VOC-based model.
The conventional VOC-based approach to classification is complementary to spectrometry techniques that can quantify VOCs in a sample with high accuracy (such as GC-MS).Through statistical and machine learning methods, the most relevant VOCs are identified as potential biomarkers and used as predictors for detecting the disease in future samples.The main advantage of this approach is its interpretability, and it permits investigation into the metabolic processes that produced the specific VOC biomarkers.Further, if a list of reliable, consistent VOC biomarkers were established for the disease, then these could be used to develop effective screening systems with a number of different technologies, as the biomarkers would not be specific to any one platform.However, as the search for these biomarkers in recent years has yet to yield any VOCs of clinical relevance, the breathprinting approach is a promising alternative.
The breathprinting technique, most often associated with cross-reactive sensor arrays that cannot discern specific VOC constituents, aims to identify patterns associated with disease from the sensor response.The flexibility of this approach offers some advantages over standard VOC identification.It is not reasonable to expect a homogeneous breath profile across all individuals with lung cancer; in addition to environmental confounders and individual-specific differences, different lung cancer cell mutations will ensure that all breath profiles are unique to a degree [42].As the complex relationships, origins and metabolic pathways for VOCs in exhaled breath are still not well understood [43], fixating on specific VOCs may be a limiting approach to detecting disease.With breathprints, however, machine learning techniques can be harnessed to uncover subtle differences in breath profiles and learn different manifestations of the lung cancer.Advanced feature extraction and transformation techniques may be able to uncover complex patterns that the VOC identification approach cannot.Ultimately, classification does not require knowledge of the specific VOCs in a sample.
Based on the improved performance observed with the spectral breathprinting technique, the authors endorse this approach for lung cancer detection.Though more abstract than the VOC features, the histogram features from the 1D-LBP representations of the spectra and their derivatives were able to more effectively characterize the health of the subjects than the VOC concentrations.Further, because both sets of features are different representations of the same original data, and fitting VOCs to the spectra is not a trivial undertaking, it is computationally expedient to bypass the VOC identification step and focus on drawing out the most useful patterns from the breathprints themselves.An advantage of CRDS is that it permits VOC extraction when necessary, so further analysis and interpretation can be performed in addition to the breathprinting approach to examine the samples from a biological perspective.With non-selective e-nose technologies, this type of investigation would require a secondary technology such as GC-MS [44].

Robust biomarker identification
The third purpose of this study was to propose a robust breath biomarker identification method for VOCs and spectral breathprints, given the lack of reproducibility in previous works.The 68% consensus criterion implemented in this work that filters out noise-based, irrelevant features may help to reduce conflicting results in future studies.Based on the proposed method, there were two VOCs that were identified as potential NSCLC breath biomarkers from both nested and non-nested CV learning curves: (1) isopropanol and (2) dimethyl sulfide (figure 4).
Isopropanol (also called isopropyl alcohol or 2-propanol) is a common ingredient in regular household items such as disinfectants and hand sanitizers.The results of the present investigation are in support of previous studies [9,[45][46][47][48] which suggest that exhaled isopropanol concentrations are significantly higher (p < 0.05) in patients with lung cancer than in controls (table 2).While some suggest that this may be caused by the use of disinfectants in hospital rooms [49], it should be noted that both lung cancer and control subjects completed the sample collection in the same hospital rooms (one room at each of the three sites) in the present investigation.Further, isopropanol has been repeatedly identified as a potential lung cancer biomarker in previous investigations [9,[44][45][46][47][48].Among these, the impact of isopropanol on discrimination has been identified as very high.For example, using solid phase microextraction (SPME) and gas chromatographytime of flight-mass spectrometry (GC-TOF/MS), isopropanol had the highest discriminant ability among 20 potential VOC breath biomarkers, determined by linear discriminant analysis (LDA) [45].Using similar SPME-GC-MS technology, but different multivariate analysis (i.e.principal component analysis or PCA), the first three principal components (PC1-PC3) showed significant differences between 31 lung cancer patients, 31 smokers, and 31 healthy controls, and isopropanol and 1-propanol were the most positively correlated substances on PC3 [46].With a combination of the predictors isopropanol, formaldehyde, and age with quadratic discriminant analysis (QDA), an accuracy of 96% (sensitivity 54%, specificity 99%) was achieved for distinguishing 17 lung cancer patients and 170 healthy controls using proton transfer reaction mass-spectrometry (PRT-MS) [47].Review studies [6,7] also determined propanol as the most frequently emerging biomarker of lung cancer, and in the human body, propanol is believed to be mostly isopropanol (or 2-propanol) [41].
Dimethyl sulfide (DMS) in breath is most often associated with halitosis [50].The results of the present investigation indicate an increase in dimethyl sulfide (p < 0.05) in the exhaled breath of lung cancer patients as compared to non-cancer controls (table 2), which is in support of previous findings [48,49].While Kischkel et al [46] reported that the concentration of dimethyl sulfide was lowest in lung cancer patients, they posited that their finding may be related to dental status rather than to cancer-specific effects.Despite this, and similar to isopropanol, dimethyl sulfide has been identified as a key VOC breath biomarker for discrimination between lung cancer patients and healthy controls using decision tree classification [48].
There are some interesting considerations regarding isopropanol and dimethyl sulfide.First, both compounds have been identified as breath biomarkers for patients with cystic fibrosis, a progressive, serious genetic disease that causes breathing abnormalities including lung infections, a persistent cough, and shortness of breath [41,51].Second, both compounds have shown some level of association with smoking.A previous work found that dimethyl sulfide was one of the most important compounds for discriminating healthy subjects with different smoking habits [49].Additionally, in the breath of active smokers, one study [49] found that the concentration of isopropanol was higher while another [46] found that the concentration was lower when compared to non-smoking controls and lung cancer patients.However, it should be noted that neither isopropanol nor dimethyl sulfide were found to differ significantly due to smoking habits in this work, and neither have emerged as established smoking-related VOCs in the literature [6].
From the non-nested CV learning curve, one additional VOC biomarker was identified, butyric acid, which was found in the non-cancer control subjects but not in NSCLC patients (table 2).To our knowledge, this VOC has not been previously identified as a potential lung cancer biomarker [6,7].It has, however, been found to be decreased in oesophagogastric cancer patients compared to healthy subjects [52].It is interesting to note that this fatty acid can also increase in breath after ingestion of a meal in healthy subjects [53].In the present investigation, a significant increase in butyric acid was found for subjects that had eaten within three hours of the collection time.Moreover, all subjects that presented with butyric acid had eaten less than 14 h prior to data collection (most more recently than 6 h).The finding may therefore be at least partially related to food intake in the control group.It should also be reiterated that butyric acid was not identified as a biomarker using the proposed consensus method with the nested CV framework.It is possible that the stricter protocol for nested CV, which imposes a 68% agreement across CV folds in addition to the consensus across iterations and sample sizes, was able to filter out butyric acid as a noise-based feature.The nested CV protocol is especially effective at minimizing the effects of outlier subjects: after all, butyric acid was only found in a total of ten subjects.
In addition to smoking history and food/diet, age and gender have been recognized as important confounding factors in previous studies on the concentration variation of VOCs [6,54].In this work, no differences were found between females and males for any of the identified consensus biomarkers, and only isopropanol was found to show some association with age (table 3).Although the subgroup analysis did not have strong evidence to indicate that confounders had a significant impact on classification, it is possible that systematic differences between cohorts may have biased performance.The statistical comparisons in table 1 indicated that the control cohort had fewer smokers, lower CAT scores, and was younger on average than the NSCLC cohort.Further, though non-cancer diseases and conditions were permitted in both groups, the control group had significantly fewer instances of COPD and pneumonia and many control participants could be considered healthy.Importantly, some mismatch was tolerated between groups to mitigate the risk of undiagnosed lung cancer in the control group.Some factors, including diet and medication use, further, were intentionally unconstrained for the subjects to ensure a realistic representation of the lung cancer population.Regardless, future works should take additional steps to (1) explore and minimize the effects of confounders, and (2) more accurately represent the intended screening population.Primarily, high-risk controls should be recruited (e.g. over 55 years old, 30+ pack-year smoking history) to better match the NSCLC cohort, and medical conditions that present similarly to lung cancer should be more prominently represented.Furthermore, additional factors that were not explored in this work, such as caffeine intake, exercise, and medication use should be constrained or incorporated into future analyses to limit potential biases.When sample size permits, subgroup-specific models (for example, one model for smokers, one for nonsmokers) may also be a useful tool for addressing confounders that cannot be reasonably controlled.Lastly, although the same collection sites were used for both groups to limit bias from environmental contaminants, future works should also consider comparing breath spectra to spectra for the surrounding room air.This could help to enhance the endogenous compounds of interest and reduce irrelevant differences related to fluctuations in environmental VOCs.
For the spectral breathprints (VOC patterns), two consensus features were identified (table 2).Both are features based on the 1D-LBP of raw spectra.This feature extraction method has shown its discriminative power in many biomedical signal and image processing applications with the ability to analyze data in real-time applications (due to its computational simplicity) [25].To our knowledge, though, this work is the first to apply 1D-LBP for extracting useful information from breathprints.
Ultimately, the present learning curves (figure 3) suggest that more samples are needed to fully exploit the potential of this approach, as the curves do not appear to have reached convergence by 158 samples.Due to the high complexity and variability associated with exhaled breath, a large sample size is necessary to represent the wide array of profiles that would be encountered during screening.The addition of samples would also permit the implementation of more advanced machine learning techniques, which should further improve classification performance.While the present investigation was restricted to the 1D-LBP features to limit overfitting to the available samples, several feature extraction methods could be potentially useful for breathomics analysis, including PCA and barcoding [55,56].Wrappertype feature selection techniques, such as recursive feature elimination (RFE) and genetic algorithm (GA) selection, may also be able to better locate biomarkers of interest among the extracted features.Alternatively, deep learning algorithms can take advantage of larger sample sizes to learn complex feature mappings, replacing the need for traditional feature extraction, feature selection, and classification techniques.
Additionally, although the present investigation employed a library of 152 common VOCs in exhaled breath, it has been acknowledged that more than 3000 different VOCs can be observed in human breath samples [57].Future studies may therefore incorporate a larger spectral library.Similarly, the spectral regression procedure is limited by the number of measured wavelengths: with 73 measurements, a maximum of 73 VOCs can be fitted.Hardware advances for future works will expand the range and resolution of the CRDS spectra to augment VOC information from the source data.Lastly, it should be noted that small quantities of water present in the tubes prior to storage at −20 • C may have affected compound stability.Future works will investigate sample integrity during storage.

Conclusions
This pilot work demonstrated the feasibility of a novel CRDS breath profiling system for discriminating NSCLC patients from non-cancer controls.While CRDS outperforms prevailing technologies like mass spectroscopy and sensor-based systems in terms of practicality for real-world screening, a learning curve evaluation using two different validation frameworks showed that the classification models developed in this work were also on par with these systems in discrimination ability.Moreover, the use of a spectral breathprinting approach for classification was endorsed as it provided improved discrimination ability over a traditional VOC-based approach.With more samples, the flexibility of the breathprinting approach also permits further improvement in classification performance through advanced feature extraction and deep learning techniques, which can augment the information from the breath profiles to better capture underlying disease signatures.As CRDS technology permits the quantification of VOCs in a sample through spectral regression, a supplemental VOC-based analysis can be performed alongside the breathprint classification for interpretation when desired.Lastly, through a proposed consensus-based biomarker identification, isopropanol, dimethyl sulfide, and butyric acid were found as potential VOC NSCLC biomarkers along with two 1D-LBP spectral breathprint biomarkers.This work serves as an early-stage validation of infrared CRDS combined with machine learning for lung cancer screening.

Figure 2 .
Figure 2. Illustration of the two validation frameworks: non-nested LOOCV and nested LOOCV.
(c) Identify the features that were selected in at least T% of the nested CV feature sets, creating a single feature set representing the consensus across CV folds.For the given subset of samples, there will now be a single 'fold-consensus' nested CV feature set and a single non-nested CV feature set.(d) Repeat Steps (b) and (c) for multiple sample sizes N by progressively adding random samples to the subset.(e) Repeat Steps (b)-(d) for multiple iterations with

Figure 3 .
Figure 3. Averaged learning curves and 95% confidence intervals from ten iterations of non-nested and nested CV estimates for models built using (A) VOC and (B) spectral breathprint features.

Figure 4 .
Figure 4. Consensus VOC and breathprint features identified for (A), (B) the non-nested and (C), (D) the nested CV learning curves.Filled squares represent features (y-axis) that attained 68% consensus across the ten randomized subsampling iterations (and across nested LOOCV folds, as applicable for subfigures (C) and (D)) at the corresponding sample size (x-axis).Features that also attained 68% consensus across sample sizes, excluding samples sizes lower than 40, are highlighted in red.

Table 1 .
Subject demographics and clinical factors.
a t-test indicated a significant difference between groups (p < 0.05).b Fisher's exact test indicated a significant difference between groups (p < 0.05).

Table 2 .
Tests of significance for the VOC features and spectral breathprint features identified through the 68% consensus procedure (see figure4).A two-part Wilcoxon test was used to compare the VOC features and Wilcoxon rank sum test was used to compare the spectral breathprint features between the control and NSCLC groups.

Table 3 .
Subgroup tests of significance for each of the potential VOC biomarkers (the logarithm of the linear fitting coefficients) and spectral biomarkers (the LBP frequencies).
a Two-part Wilcoxon test indicated a significant difference between groups (p < 0.05).b Wilcoxon rank sum test indicated a significant difference between groups (p < 0.05).