Introduction

Lung cancer is the leading cause of cancer death worldwide [1]. The prognosis is strongly dependent on the tumour stage at the diagnosis time [2]. The early diagnosis and treatment of lung cancer is essential for reducing the mortality of this type of cancer [3].

In the early stages, lung cancer is usually asymptomatic and often presents as a pulmonary nodule [2]. However, pulmonary nodules may have several causes [4]. They are a common incidental finding on imaging scans performed for various indications [4]. They are also a common finding on scans performed for screening of lung cancer [5]. In both cases, most nodules are benign, but a small proportion represent lung cancer, usually at an early stage; hence it is important to identify them correctly [5, 6].

The management of pulmonary nodules requires an initial evaluation of malignancy risk by a computed tomography (CT) scan of the thorax [7, 8]. The subsequent diagnostic work-up may include a 2-deoxy-2-[18F]fluoro-D-glucose (2-[18F]FDG) positron emission tomography/computed tomography (PET/CT) for metabolic characterisation of solid or partially solid pulmonary nodules [7, 8]. The British Thoracic Society guidelines [8] recommend a 2-[18F]FDG PET/CT for solid nodules with ≥10 mm in diameter and a malignancy risk >10%. The Fleischner Society guidelines [7] consider 2-[18F]FDG PET/CT as an option for evaluation of solid nodules with >8 mm in diameter and partially solid nodules with a solid component >8 mm in diameter.

The inclusion of 2-[18F]FDG PET/CT in the diagnostic work-up reduces the proportion of futile invasive diagnosis procedures [8], which can be a source of complications [4]. On the other hand, a PET/CT would anticipate a potential diagnosis of lung cancer compared to a CT surveillance strategy.

2-[18F]FDG PET/CT interpretation relies on intensity of tracer uptake in the lesion (nodule-to-background contrast) through qualitative [9] or quantitative analysis [10]. Classical machine learning models for supporting the differential diagnosis of pulmonary nodules have been developed from PET imaging features (SUVmax and/or radiomics), in some cases, also combined with either CT imaging features (radiomic or visually extracted features) or non-imaging features [9, 11,12,13,14,15,16,17,18,19,20]. Training machine learning models with radiomic features requires the extraction of dozens of handcrafted features and a laborious process of feature selection [21, 22]. In addition, the radiomic features are sensitive to variations in the image acquisition and reconstruction, segmentation, image processing and feature computation in multi-center setting, and extensive standardization and harmonisation are required to obtain reproducible models [21,22,23].

Deep learning is a subfield of machine learning, which in turn is part of the artificial intelligence. Deep learning models can learn useful representations for the predictive task, directly from labelled raw data [24], such as images, having the potential to improve the classification of pulmonary nodules.

Deep learning models have been successful in medical imaging. They have reached comparable performance to physicians or even outperformed them in specific tasks, in areas as diverse as dermatology [25,26,27], ophthalmology [28,29,30,31], pathological anatomy [32,33,34] or radiology [35,36,37,38].

The main objective of this research was to develop a convolutional neural network (CNN) model for classification of pulmonary nodules from an annotated dataset of 2-[18F]FDG positron emission tomography (PET) images. Secondarily, the hypothesis that the model outperforms the maximum standardised uptake value (SUVmax) was tested. Explanations for the decisions of the model were obtained by gradient-weighted class activation mapping (Grad-CAM).

Materials and Methods

This study was conducted in accordance with the Declaration of Helsinki and national regulations. The study was approved by the University Hospital Centre of São João, Porto, Portugal, which included approval by the institutional Ethics Committee and the Responsible for Data Reuse. The informed consent of the participants was waived due to the retrospective nature of the research.

Image Dataset

A PET image dataset of pulmonary nodules was created. To ensure the quality of the data for modelling, the eligible population, the reference standard and the sampling procedure were first determined. Then, the data were collected and preprocessed.

Eligible Population

The participants belong to the eligible population if they cumulatively meet the following inclusion criteria:

  • One or more indeterminate solid pulmonary nodules with more than 8 mm in average diameter. The average diameter should not exceed the 30 mm, according to the nodule definition provided by the British Thoracic Society guidelines [8]. The average diameter of the nodule corresponds to the average of long-axis and perpendicular short-axis diameters, both of which are obtained on the same orthogonal slice, such as defined in the Fleischner Society Guidelines [7];

  • The nodule detection was incidental or through screening;

  • 2-[18F]FDG PET/CT was performed for clarification of the nodule(s), and the reconstructed images are available in digital format. The pathological status of the nodule(s) is unknown at the time of the PET/CT (indeterminate nodule);

  • The nodule was biopsied or excised and obtained a histopathological or cytopathological examination, otherwise completed an imaging follow-up period.

Those with at least one of the following criteria were excluded:

  • History of lung cancer;

  • History of other cancers, except:

  • Non-melanoma skin cancer, low-risk localised prostate cancer, in situ cervical cancer, in situ breast cancer, or superficial bladder cancer, which has been treated at least 6 months ago.

Reference Standard

The reference standard for the pulmonary nodule status was defined on the basis of the histopathological or the cytopathological examination, and/or the nodule behaviour during the follow-up period with CT. It attributes one of two classes (benign or malignant) to the target feature which is the status of each pulmonary nodule as following:

  1. 1.

    A nodule is defined as malignant if biopsied or excised during the initial diagnostic workup or during the follow-up period, and the histopathological or cytopathological examination shows a malignant neoplasm.

  2. 2.

    A nodule is defined as benign if:

    1. a.

      Excised and the histopathological examination showed benign pathology;

    2. b.

      Biopsied, the biopsy was diagnostic and the histopathological examination showed a benign pathology;

    3. c.

      Neither excised nor biopsied, or biopsied but non-diagnostic and during follow-up:

      1. a.

        The nodule disappeared;

      2. b.

        The nodule decreased or kept the same size for, at least, 2-year of follow-up;

      3. c.

        The nodule increased in size and thereafter was biopsied or excised and the histology was benign;

      4. d.

        Volume doubling time >600 days and <25% change in volume for, at least, 1 year of follow-up.

A minimum of 2-year imaging follow-up was established for solid nodules when the mean axial diameter of the nodule was used for follow-up. When the follow-up period was between 1- and 2-year, the nodular volume was estimated from the diameter on three orthogonal axes. These follow-up criteria are based on the doubling time of malignant solid nodules and are recommended for pulmonary nodule management [7, 8].

Sampling

Every patient referred to the University Hospital Centre of São João and who underwent a 2-[18F]FDG PET/CT scan between 2010 and 2019 was consecutively selected if he/she belongs to the defined population.

If a patient underwent more than a PET/CT scan, only the first one was considered. If a patient has more than one nodule that fills the eligibility criteria, only the more suspicious was included.

Among the 7130 PET/CT scan requests within the established time interval, the 2-[18F]FDG PET/CT scans that aimed at clarifying the diagnosis of pulmonary nodules were selected. Then, the eligibility criteria were checked for those by consulting the medical records and the information of the histopathological/ cytopathological examination, the standard-dose CT scan and the 2-[18F]FDG PET/CT scan. In the end, 113 participants were eligible to create a PET image dataset.

Image Acquisition, Preprocessing, and Annotation

All patients underwent a PET/CT scan with a field of view between the skull base and mid-thighs around 60 min after the 2-[18F]FDG intravenous injection. The exams were acquired in three different scanners (GE Discovery IQ 4R, GE Discovery LS/4 and Siemens Biograph 6). The PET images were reconstructed using the ordered subset expectation maximisation method. Attenuation correction of PET data was performed with low-dose CT-derived attenuation maps.

The image preprocessing was performed on 3D Slicer 4.10.2 r28257 [39]. Both the PET and CT image files were imported and coregistered with rigid registration. Once the PET/CT scans were performed in different scanners, the PET volumes have different voxel size and anisotropic spacing. Therefore, the volumes were resampled to obtain the same voxel size and isotropic voxels. The voxel side was set to 1.5 mm which is a smaller size than the smaller voxel side of the three scanners. Linear interpolation was used for spatial resampling.

The nodule was visually identified in the coregistered PET/CT images, and a cubic region of interest was drawn and cropped to include the entire nodule. The center of this subvolume coincides with the center of the nodule. The subvolume of interest has a side length equal to twice the maximum possible diameter of the nodule (60 mm × 60 mm × 60 mm). The obtained subvolume was saved in .nrrd format. Each cropped subvolume containing a pulmonary nodule was annotated with the corresponding class of the target feature (benign or malignant).

Formulation of the Deep Learning Task

The supervised deep learning problem is a single task, single label, binary classification problem that inputs cubic regions of interest from PET for a three-dimensional (3D) CNN.

Let X be a random variable that represents an input, i.e. a PET image, which corresponds to a tensor, being the axes 1, 2 and 3, the shape of the volume-of-interest (40 × 40 × 40) and the axis 4, the number of channels, in this case only one. Let Y be a random variable which corresponds to the target. Let S be a training set with n pairs (X, Y) of independent and identically distributed samples drawn from the population. Then, the learning problem consists of using a CNN-based algorithm for choosing from the hypothesis space, the hypothesis or model that best approximates an unknown mapping function f: XY in the population, using the training set as a starting point. Model training is performed by discovering the parameter configuration that minimises a loss function in the training set, the structural risk, a surrogate of the expected risk [40, 41]. However, minimising the risk in the training set is prone to overfitting and a dissociation between the expected and structural risks occurs at any time during the training [42]. An estimate of expected risk in the validation set is more accurate, but cannot be used to update model parameters, so it may be used to decide when to stop training [42].

Experimental Setup

Input Data Splitting

The dataset was randomly split into five stratified partitions of similar size. The stratification was performed by the target class in order to maintain the same class distribution of the original data in each data partition. Four partitions were used for 4-fold cross-validation, and the fifth one was reserved for testing. In each fold of cross-validation, three out of four partitions were used for training, and the remaining one was for model evaluation. Therefore, 4-fold cross-validation was used for training, evaluating, hyperparameter tuning and comparing different models that were built from different network architectures and, in the end, for choosing the best model. Cross-validation was preferred because it guarantees lower variance than the holdout method for the size of the obtained dataset [43].

Since tuning a model is a repetitive process, there is some leakage of information from the validation partition into the model, even it is not directly trained on it, resulting in overfitting of the model to the validation set and optimistic performance metrics [44]. For obtaining of unbiased estimates of the model performance, a test set partition was used only once to evaluate the best model, which was selected among all those trained during the cross-validation phase.

The input data for the network were subjected to fold-specific min–max normalisation to the range [0, 1]. The validation and test sets were also normalised with values of the training set of the respective fold. Data were randomly shuffled on every epoch during the training.

Figures 1 and 2 represent the middle axial slice of each PET volume that composes the cross-validation dataset grouped by the target class. The test set was not represented to avoid information leakage during the construction of the models.

Fig. 1
figure 1

Cross-validation dataset. Middle axial slice of each PET volume. Malignant pulmonary nodules

Fig. 2
figure 2

Cross-validation dataset. Middle axial slice of each PET volume. Benign pulmonary nodules

Data Augmentation

Offline data augmentation [45] was performed independently in each of the four partitions of the original dataset previously created for cross-validation. Translations, rotations and Gaussian noise injection were separately applied to the original images. The test set was not augmented. The augmentation factor for each type of operation was class-specific in order to perform class balancing (Table 1). The dataset comprises original and augmented images, having around 4900 images. The size of the augmented dataset was determined by the computational resources available for training the models in a larger dataset. During the cross-validation, the models were trained in an augmented training set on each fold. The evaluation occurred in the corresponding validation set of the original dataset.

Table 1 Data augmentation factor

Translations were random shifts between −10 and 10 pixels on any of the 3 axes of each original image. A maximum amplitude of 10 pixels (15 mm) was chosen to ensure that the nodules were not moved out of the tensor and the label was preserved.

Random rotations between −45 and 45° were separately applied around the x-, y- or z-axis of each image so that each one yields augmented examples with different rotation axes, but each new augmented example has a rotation applied only around a given axis. Since the rotations were around an axis which runs through the centre of the tensor, a rotation was actually a composite operation (Protated = T−1 × R × T × P), where P is the voxel, T a translation operation and R a rotation operation [46]. After the spatial transformation of the coordinates of the voxels, an intensity interpolation with a bilinear interpolator was applied.

Gaussian noise with a mean of zero and three different values of standard deviation (0.1, 0.3 and 0.5) was added to the original images. The reason for adding Gaussian noise was to be able to model the PET image noise [47], so that different augmented images simulate PET images with different levels of noise.

The background voxels were filled with zero in all the above operations.

Training Procedures

The experiment was run in R language [48]. R Interfaces for Tensorflow (v. 2.2.0) [49] and Keras (v. 2.3.0.0) [50] and r-reticulate package [51] along with Tensorflow 2.1.0 [52] and Python 3.7.8 [53] were employed. The graphic card used was an NVIDIA GeForce MX150.

Models were trained with binary cross-entropy loss [41] and Adam optimiser [54] throughout the entire experiment. The learning rate was tuned until the optimal value was reached. The learning rate of the different models is shown in the Table 2.

Table 2 Models trained by cross-validation

The stopping criterion of the training corresponds to the minimum validation loss with a patience of ten or a maximum of 100 epochs. The model derived from the training epoch with the lowest validation loss was saved. This procedure was repeated for each fold of the 4-fold cross-validation, resulting four model versions, which have different values of parameters, but identical hyperparameter configuration. Early stopping ensures that the minimisation of the structural risk does not occur beyond the point of the best generalisation, obtaining a regularising effect [55].

The original dataset was trained with full-batch learning or with a mini-batch learning with batch size of 16, according to the type of network. Mini-batch learning with batch size of eight was preferred with augmented data.

Other specifications of the training procedures were changed according to the network architecture or even in networks of the same architecture (i.e. treated as hyperparameters), being explained in more detail in the next section.

Network Architecture

Three types of 3D CNN architectures with volumetric inputs were defined. These networks were generalised from the homologous 2D CNNs (Alexnet [56], VGGNet [57] and Inception-v2 [58]), and the size of the networks was adapted to the complexity of the problem and the size of the dataset. As such, number and arrangement of layers, number of filters, kernel size and other network specifications were treated as hyperparameters, which were tuned until the proposed models were found. These networks were trained using either the original or the augmented datasets. Additionally, a 2D pre-trained model was fined-tuned in the original dataset. Some details of the different network architectures are in the Table 2.

Leaky ReLU (with α = 0.3) was the preferred activation function in 3D CNN because of allowing a small, non-zero gradient when a unit is not active and thus prevents ‘dying ReLU’ [59, 60].

Weights were randomly initialised according to the scheme proposed by He et al. [61], which was specifically developed to address the rectifiers.

A network architecture inspired by Alexnet [56] was proposed. Named as Stacked 3D CNN, it is characterised by four 3D convolutional layers and three 3D max-pooling layers alternately stacked and connected to three fully connected layers (32, 16 and one units, respectively). The first convolutional layer has eight filters. Network width increases along the convolutional base by doubling the number of filters every convolutional layer. The kernel size was 3 × 3 × 3, and the kernel stride was one in the convolutional layers. No padding was applied. Pooling layers consist of max-pooling operations with kernel size of 2 × 2 × 2, stride of two and no padding. The first fully connected layer receives the output of the last convolutional layer after being flattened into a vector (Fig. 3).

Fig. 3
figure 3

Architecture of the Stacked 3D CNN network (final model)

The VGG-like network is characterised by a total of ten layers, being multiple stacked convolutional layers, some of them followed by a pooling layer. The output of the last convolutional layer is flattened before a fully connected output layer (Figure A1 of Supplementary Information). The efficient use of 3 × 3 × 3 convolutions is a prominent property of this type of network [57]. Thus, convolutions of larger kernel size are factorised into 3 × 3 × 3, while the receptive field is preserved. Consequently, the depth of the network increases, while the number of parameters is reduced. More specifically, 5 × 5 × 5 and 7 × 7 × 7 convolutional layers are replaced by sets of two or three 3 × 3 × 3 convolutional layers. Factorisation of convolutions imposes a greater reduction in the number of parameters in a 3D than in a 2D network, and therefore a greater regularising effect (Section A.1.1 of Supplementary Information). Both expansion of feature maps and decreasing of the spatial resolution only occur after each pooling layer. Overlapping max-pooling [56] with a pool size of 3 × 3 × 3, strides of two and padding was applied.

The Inception-v2-like network is a 3D CNN with three main characteristics—inception modules, 1 × 1 × 1 convolutions and factorisation of convolutions—which introduce sparsity in the network and reduce the number of parameters, making the network more efficient [58]. Inception modules consist of blocks of several convolutional layers with different kernel size and a pooling layer that receive the same input, propagate the information in parallel and concatenate the output before passing it to the next layer [58]. Much of the computational efficiency is achieved by using 1 × 1 × 1 convolutions to compute reductions of the number of feature maps before expensive convolutions of larger kernel size. Factorisation of convolutions of larger kernel size into stacks of 3 × 3 × 3 or asymmetrical convolutions while preserving the receptive field provides a further increase in efficiency. There are two types of inception modules—one standard module for learning representations and another one that simultaneously downsizes the feature maps [58]. Two versions of each were implemented in the network. The proposed network has four standard inception modules and two reduction modules. The output of the last convolutional layer is converted into a vector by global average pooling, which is received by a fully connected output layer (Figures A2 to A6 of Supplementary Information).

Transfer learning [62] from Imagenet dataset [63] was also performed. Pre-trained Resnet-50 [64] was used as a feature extractor. Two fully connected layers were added to its top and initialised according to He et al. [61]. Because the dataset of the current problem is quite different from that of the source domain, only the earlier layers of ResNet-50 were used (until conv3_block1_out). Additionally, a few of the top layers (from conv3_block1_1_conv) were fine-tuned with a very low learning rate. Due to the 2D architecture, the input for this network consists of 3 central slices (19, 20 and 21) of the PET volume, which are orthogonal to the third axis, being each one stored in a different channel.

Besides the regularisation procedures already described, L2 regularisation [42] was applied to all layers with parameters of the 3D CNN models, whereas dropout [65] was applied to the fully connected layers of the 2D CNN model.

Performance Metrics and Model Selection

The performance metric selected to evaluate the models was the area under the receiver operating characteristic (ROC) curve. It was computed with the trapezoidal rule from non-parametric ROC curves [66]. During the cross-validation, model evaluation was conducted in the validation partition of the respective fold. Different models were compared by their mean area under the ROC curve of the 4-folds.

Models with different network architectures were trained. In order to deal with a source of non-determinism on Tensorflow GPUFootnote 1, the best model of each network architecture was retrained and evaluated again by 10 iterations, under identical conditions, on the 4-fold cross-validation. The average performance metrics over the 10 iterations for the different models were compared, and the best model was selected. Since that model has 10 versions for each fold, one of them was randomly picked.

Subsequently, an ensemble classifier was built from the four versions of the best model derived from the 4-fold cross-validation, by averaging their output probabilities, weighted by the size of each training partition. This ensemble classifier was evaluated in the test set to determine its generalisation performance over unseen examples. The 95% confidence interval of the area under the ROC curve was also determined for the test set according to the method described by DeLong [67].

Accuracy, sensitivity and specificity were complementary metrics determined in the test set. Instead of using the standard decision threshold of 0.5, an optimal decision threshold was determined for each version of the best model in the respective validation partition. The four decision thresholds were averaged, and the resulting threshold was applied to convert the output probabilities of the ensemble model into classes, in the test set. If the predicted probability was equal to or higher than the threshold, the nodule was classified as malignant; otherwise, it was classified as benign. The optimal threshold was determined according to two different approaches. In the first one, the value of the optimal threshold was the posterior probability which maximises the Youden index [68]. In another scenario, the cost of a false negative was considered higher than the cost of a false positive. Therefore, a minimum sensitivity was set to 95%, and the cut-off point which maximises the specificity was searched.

Paired Comparison Between the Final CNN Model and the SUVmax

As a secondary analysis, a hypothesis test was performed, in the test set, to infer about a possible difference in the area under the ROC curve between the final CNN model and the SUVmax in the population (H1: AUC ROCCNN ≠AUC ROCSUVmax), starting from the null hypothesis of equality. The type I error (α) was predefined as 0.05.

The non-parametric test developed by DeLong et al. [67], which makes a paired comparison of the area under the ROC curves, is applied if the area of one ROC curve is uniformly higher than the other across all operating points, that is, the curves do not cross each other; otherwise, a hypothesis test based on the ROC shape proposed by Venkatraman and Begg [69] is applied.

Model Explainability

Grad-CAM analysis [70] was applied to generate visual explanations for the decisions of the model. This method highlights the most class-discriminant regions of a volume-of-interest under the 3D CNN classification model standpoint. Insights about how the model succeeded or failed were obtained. The Grad-CAM 3D heatmap was obtained for each PET volume of the test set from each of the four 3D CNN model versions which compose the ensemble model. Fusion images were created by superimposing the axial slices of the Grad-CAM 3D heatmap and the axial slices of the original input PET volume for selected cases. The volumes were reprocessed to obtain ten axial slices rather than forty to facilitate the representation of the images. Red and dark red tones represent higher Grad-CAM score for a class, as such they were the most relevant regions of the input volume for model decision.

Results

Descriptive Statistics

The dataset has 113 participants. Seventy-six (67.3%) of participants were male. The median age was 65 years old (interquartile range (IQR): 14 years). One nodule was included by participant. The median diameter of the nodule in low-dose CT scan was 13 mm (IQR: 5 mm). Fifty-one (45.1%) malignant pulmonary nodules were found; the remaining were benign. Table 3 shows the distribution of the nodules according to the type, detailing the histological type of the malignant nodules.

Table 3 Characterisation of the pulmonary nodules according to the histological type

The reference standard was obtained by histological examination, cytological examination or follow-up CT scan in 71 (62.8%), 1 (0.9%) and 41 (36.3%) of the nodules, respectively. When the reference standard was obtained by follow-up CT scan, the median follow-up was 2.6 years (minimum: 1.3 years; maximum: 8.3 years), and 85.4% of the participants had a follow-up time ≥ 2 years.

Evaluation of CNN Models by 4-Fold Cross-Validation

Table 4 shows the area under the ROC curve for the CNN models evaluated by 4-fold cross-validation. The classification performance measured by this metric ranged between 0.8864 for a Stacked 3D CNN model and 0.7738 for a ResNet-50 pre-trained model. Regardless of the type of model, it was consistently found that models trained on the original dataset performed better than those trained on the augmented dataset.

Table 4 Evaluation of the CNN models by 4-fold cross-validation

The retraining and evaluation over 10 iterations of 4-fold cross-validation has resulted in a mean area under the ROC curve of 0.8822, 0.8760 and 0.8690 for the best models of each architecture (Stacked 3D CNN, VGG-like and Inception-v2-like models, respectively), all trained in the original dataset (Tables A1 to A3 of Supplementary Information). ResNet-50 was not retrained because its performance was much lower than other architectures. The Stacked 3D CNN model showed consistently the best performance on the iterated cross-validation.

Evaluation of the Final CNN Model in the Test Set

The final model (Stacked 3D CNN model) obtained an area under the ROC curve of 0.8385 (95% CI: 0.6455–1.0000) on the test set (Fig. 4).

Fig. 4
figure 4

Comparison of the ROC curve between of the final CNN model and the SUVmax on the test set

For a decision threshold (0.5039) which maximises the Youden index in the cross-validation, the model obtained a sensibility of 40.0%, a specificity of 100.0% and an accuracy of 73.9% for classifying pulmonary nodules, on the test set. Whereas, for a decision threshold (0.3149) that ensures a minimum sensitivity of 95% on the cross-validation while maximises the specificity, the model had a sensibility of 80.0%, a specificity of 69.2% and an accuracy of 73.9% on the test set.

Comparison Between the Final CNN Model and the SUVmax

Figure 4 shows a comparison of the ROC curves between SUVmax and final CNN model. Since the ROC curves cross each other at various points, a paired comparison with the Venkatraman and Begg test [69] was applied to evaluate the equivalency of the curves rather than the area under the curve. The test statistic (E) was 22, and the two-side P-value was 0.7995, based on 2000 permutations.

Grad-CAM Analysis

Visual analysis of the Grad-CAM 3D heatmaps generated for the 3D CNN models versions (F1 to F4) that compose the ensemble model was performed for all examples of the test set. Representative cases were selected to illustrate how the ensemble model succeeds or fails. Figures 5 and 6 represent the Grad-CAM analysis of cases correctly classified by the model (true positive and true negative, respectively). Figure 7 represents the analysis of a false positive case, whereas Fig. 8 shows the analysis of a false negative case.

Fig. 5
figure 5

Grad-CAM 3D heatmaps generated for an input PET volume from the test set, containing a pulmonary nodule which was correctly classified as malignant by the ensemble model (true positive). a Thickened axial slices from the original PET volume are shown. be Thickened axial slices obtained by superimposing the original PET image and the Grad-CAM 3D heatmap. Each 3D CNN model version of the ensemble model has its own Grad-CAM 3D heatmap (one heatmap per row is represented)

Fig. 6
figure 6

Grad-CAM 3D heatmaps generated for an input PET volume from the test set, containing a pulmonary nodule which was correctly classified as benign by the ensemble model (true negative). a Thickened axial slices from the original PET volume are shown. be Thickened axial slices obtained by superimposing the original PET image and the Grad-CAM 3D heatmap. Each 3D CNN model version of the ensemble model has its own Grad-CAM 3D heatmap (one heatmap per row is represented)

Fig. 7
figure 7

Grad-CAM 3D heatmaps generated for an input PET volume from the test set, containing a pulmonary nodule which was classified as malignant while it was benign according to the reference standard (false positive). a Thickened axial slices from the original PET volume are shown. be Thickened axial slices obtained by superimposing the original PET image and the Grad-CAM 3D heatmap. Each 3D CNN model version of the ensemble model has its own Grad-CAM 3D heatmap (one heatmap per row is represented)

Fig. 8
figure 8

Grad-CAM 3D heatmaps generated for an input PET volume from the test set, containing a pulmonary nodule which was classified as benign while it was malignant according to the reference standard (false negative). a Thickened axial slices from the original PET volume are shown. be Thickened axial slices obtained by superimposing the original PET image and the Grad-CAM 3D heatmap. Each 3D CNN model version of the ensemble model has its own Grad-CAM 3D heatmap (one heatmap per row is represented)

In any case, the four versions of the 3D CNN that compose the ensemble model pay attention to quite similar regions of the PET volume. Regarding the true positive cases, the most class-specific region includes the focal 2-[18F]FDG uptake in the nodule at the center of the volume and a rim of background that surrounds the nodule. In true negative cases, the 2-[18F]FDG uptake tends to be absent in the nodule, as such the model either attaches importance to regions close to the volume boundary (organs with physiological uptake in some cases), or to a region with non-nodular shape that includes the center of the volume. In most of false positive cases, the Grad-CAM heatmap has the highest score in an ellipsoid region at the center of volume and resembles that of the true positive cases. It coincides with a 2-[18F]FDG uptake of variable intensity in the nodule. Similarly, the Grad-CAM heatmap of false negative cases resembles that of the true negative cases, whereas the nodule has a slight or absent 2-[18F]FDG uptake.

Discussion

We present a 3D CNN model for the classification of solid pulmonary nodules from an annotated dataset of PET images specifically created for that purpose. This classification task aimed to differentiate between benign and malignant nodules. To the best of our knowledge, this is the first study that addresses building a deep learning model for classification of indeterminate solid pulmonary nodules, using PET images as inputs.

The only attempts of using machine learning models for differential diagnosis of indeterminate pulmonary nodules have addressed classical methods and handcrafted features, namely radiomic features extracted from PET images [9, 11,12,13,14,15,16,17,18,19,20]. Despite some claims about the superiority of radiomic models over the visual interpretation or the SUVmax, the studies published to date have methodological issues that prevent a definitive conclusion about the added value of radiomics. Risk of data leakage and consequent overfitting was found in the studies of Palumbo B et al. [13], Albano B et al. [14], Ren C et al. [17] and Chen S et al. [11] because of performing exploratory analysis/feature selection in the entire dataset or the absence of a disjoint test set. Additionally, Palumbo B et al. [13], Albano B et al. [14] and Salihoğlu YS et al. [16] in their studies made a comparison of the performance metric between the radiomics model and the basal model/conventional method without performing a statistical hypothesis test, which prevents to make any inference beyond the respective dataset. The studies of Ren C et al. [17], Chen S et al. [11] and Zhang J et al. [12] made multiple comparisons of different models without controlling the family-wise error rate. Zhang J et al. [12] and Ren C et al. [17] found a superiority of the area under the ROC curve of the radiomic model regarding the SUVmax at the expense of the unnecessary and inappropriate binarization of the latter with a pre-specified threshold, which likely underestimate the area under the ROC curve of the SUVmax.

Regarding deep learning, Yong Han et al. [71] trained several classical machine learning models and a 2D CNN pre-trained (VGG-16) for distinguishing the histological subtype of pulmonary lesions in patients already diagnosed with a lung cancer, from a dataset of 1419 PET/CT fusion images. The deep learning model obtained an area under the ROC curve of 0.903. Despite the use of a deep learning algorithm, the classification problem is not the same as in the current research because it only included malignant lesions and the CT image data were also used.

The final model of the current research yielded an area under the ROC curve of 0.8385 (95% CI: 0.6455–1.000) on the test set. It has four 3D convolutional layers, three 3D max-pooling layers and three fully connected layers. It has a simpler and shallower architecture than the more recent types of networks published [72]. Since the inputs of the 3D CNN are volumes, it learns 3D spatial representations of the whole nodule, unlike a 2D CNN that receives only some slices intersecting the nodule, leading to a loss of spatial information (at least the simplest approaches) [73,74,75]. For this reason, a 3D CNN was preferred. However, a 3D CNN has the cost of a higher number of parameters and higher risk of overfitting [73]. As such, the capacity of the model was carefully adjusted to the problem and size of the dataset. Several regularisation methods were also applied, such as early stopping and L2 regularisation.

The probabilistic predictions were converted to target classes by determining an optimum threshold. Two approaches were used. The Youden’s index and a pre-assigned value for the sensitivity both yielded an accuracy of 73.91% in the test set. However, the sensitivity obtained with the second method in the test set was much more favourable (80% vs. 40%). This is explained by the characteristics of each method and by the variance associated to the reduced size of the test set (23 images). The specificity of the second method of threshold moving was 73.91%. A threshold that maximises the specificity, setting a minimum value sensitivity of 95% (derived from the cross-validation), can be a more appropriate approach for the current problem because a greater cost is placed on false negatives than on false positives, being assumed that the cost of missing a malignant lesion is higher than the cost of additional investigations and psychological distress caused by a false positive.

As a secondary endpoint, the performance of the 3D CNN model was compared with the SUVmax of the nodules. The model had an area under the ROC curve higher than the SUVmax in the test set (0.8385 vs. 0.8038). However, the equivalency between the two ROC curves was not rejected by a hypothesis test that compared their shape. Because the test set was not sized to ensure an adequate statistical power to the applied test, this negative result requires confirmation in well-powered studies.

Other types of 3D CNNs were also proposed, achieving a slightly lower area under the ROC curve than the Stacked 3D CNN in the cross-validation. These networks were inspired by VGG-16 [57] and Inception-v2 [58]. They are deeper and have some features that make them more efficient, such as factorisation of convolutions, introduction of the sparsity in the network or 1 × 1 × 1 convolutions.

Deep learning models usually need to be trained in a big dataset to prevent overfitting [76]. However, building an annotated dataset in medical imaging is a time-consuming and a labour-intensive task. Furthermore, the particularity of the task and the imaging modality involved imply that the number of images available may be limited, as in the current research. Even though, a model was successfully trained and regularised.

Models trained with transfer learning had a lower classification performance in the cross-validation than those models trained with random weight initialisation. This could be explained by the difference between the source domain where the CNN was pre-trained (ImageNet) and the target domain, by the 2D architecture requiring 2D inputs, or by the type of pre-trained network (ResNet). Models trained with data augmentation also had a classification performance in the cross-validation consistently lower than those trained in the original training dataset. It was out scope of the phase of model selection to make statistical inference from the differences between the models, so it is unknown the meaning of those differences as well as their cause. It is hypothesised that the size of original dataset was insufficient for the augmentations to produce any effect, or the type or the parameters of the transformations were not the most appropriate to lead to an improvement of performance in this specific type of image data and problem, or the factor of augmentation was insufficient. High-quality and representative datasets are essential for developing machine learning models and for ensuring they have acceptable generalisation performance on unseen cases. Although this is a retrospective study, the eligible population was explicitly and accurately defined. The quality of a dataset also depends on the quality of the reference standard. Predictive modelling for diagnosis purposes follows the same principles as the diagnosis tests regarding obtaining an unbiased reference standard [77]. A proof about the presence or absence of the target disease should be obtained without knowledge of the index test and vice versa [77, 78]. Similarly, the reference standard should not contain information from the data where a predictive model will be built; otherwise, the model will have an optimistic performance [77]. In the current study, that incorporation bias was prevented by using the result of the histopathological or cytopathological examination of a specimen obtained by biopsy or surgical excision or, alternatively, a follow-up period with CT. Therefore, there was a differential verification of the disease status. The histopathological characterisation of the lesion was the main method to obtain the reference standard, representing 62.8% of the cases. The CT imaging follow-up was the method to obtain the definitive diagnosis in the remaining cases (except one), with 85% of the patients having a follow-up time of at least 2 years. Surgical resection is the gold standard for definitive diagnosis of pulmonary nodules [79] that is an unbiased reference standard. The biopsy also provides direct evidence of malignancy, but there is a risk of non-specific benign changes as false negatives [80]. To eliminate that risk of bias in the biopsy, only definitive evidence of a benign pathology was considered (on first or repeated biopsies); otherwise, the follow-up criterion was applied. Imaging follow-up provides an indirect, but still strong, evidence of the status of the nodule, leading to a low risk of bias in the ground-truth. The defined follow-up criteria ensured that a malignant tumour is missed in <1% of cases, according to the previous literature [8].

In malignant nodules, Grad-CAM analysis showed that the model tends to pay attention to the nodule region during the decision, whereas in benign nodules, either no object in the lung receives particular attention or a central region with non-nodular shape receives attention. Moreover, the size and the shape of the most class-discriminant region seem to assume importance for the model decision, which raises the hypothesis that the decision can rely on nodule-background contrast and on the metabolic shape of the nodule. Model failures are explained by the similarity between the Grad-CAM heatmap of a given image and those of the misidentified class.

This study has some limitations. The model was built in a relatively small dataset. Despite the efforts of regularisation, its performance in a larger dataset is unknown. Also, the test set was small, so the generalisation performance is highly dependent on the data split. It is unknown how the model generalises in a PET scanner basis, including with images obtained from other PET scanner types not used in the current dataset.

Because this is a retrospective study, the decision of performing a PET/CT exam or a biopsy or excision of the pulmonary nodule, as well as the duration of follow-up period, was at the discretion of the attending physician. The decision criteria may have changed over time, as part of the evolution of knowledge in this area, and according to the attending physician, resulting in selection and partial verification biases [81]. When multiple nodules were present, the dataset only included the most suspicious nodule from each patient, instead of all the nodules, but in practice it is important to know the status of all of them.

The image data stores standardised uptake value (SUV) by voxel. SUV has been popularised, but another less used measure was claimed to be more accurate: the standardised uptake value normalised by the lean body mass (SUL) [82], once the lean and the fat tissues have different metabolic profiles. Image data were not recalculated to show SUL because the DICOM files from one of the PET scanners did not have the height data recorded.

As future work, we suggest evaluating the proposed model in a larger dataset, preferably collected prospectively from multiple centres and PET/CT scanners, and possibly to retrain it in those data. Another proposal is to train a CNN model that considers not only the PET data, but also the low-dose CT data from the same exam and non-imaging features.

The task in the current research required manual nodule location before the automatic classification. However, nodule location and classification can be combined in a single machine learning task (nodule detection).

Conclusion

In this study, we developed a 3D CNN model for automatic classification of indeterminate solid pulmonary nodules from an annotated dataset of 2-[18F]FDG PET images which was specifically created for that purpose. The model was effective for differentiating malignant and benign nodules and has potential for improving the differential diagnosis of pulmonary nodules.