Multi-kernel driven 3D convolutional neural network for automated detection of lung nodules in chest CT scans

The accurate position detection of lung nodules is crucial in early chest computed tomography (CT)-based lung cancer screening, which helps to improve the survival rate of patients. Deep learning methodologies have shown impressive feature extraction ability in the CT image analysis task, but it is still a challenge to develop a robust nodule detection model due to the salient morphological heterogeneity of nodules and complex surrounding environment. In this study, a multi-kernel driven 3D convolutional neural network (MK-3DCNN) is proposed for computerized nodule detection in CT scans. In the MK-3DCNN, a residual learning-based encoder-decoder architecture is introduced to employ the multi-layer features of the deep model. Considering the various nodule sizes and shapes, a multi-kernel joint learning block is developed to capture 3D multi-scale spatial information of nodule CT images, and this is conducive to improving nodule detection performance. Furthermore, a multi-mode mixed pooling strategy is designed to replace the conventional single-mode pooling manner, and it reasonably integrates the max pooling, average pooling, and center cropping pooling operations to obtain more comprehensive nodule descriptions from complicated CT images. Experimental results on the public dataset LUNA16 illustrate that the proposed MK-3DCNN method achieves more competitive nodule detection performance compared to some state-of-the-art algorithms. The results on our constructed clinical dataset CQUCH-LND indicate that the MK-3DCNN has a good prospect in clinical practice.


Introduction
Lung cancer is one of the malignant tumors that pose the greatest menace to human life and health.As reported in a statistical analysis of the global cancer burden, lung cancer accounts for nearly 11.4% and 18% of 19.3 million new cancer cases and 10.0 million cancer deaths, respectively [1].The five-year survival of early-stage lung cancers is significantly higher than that of advanced lung cancers, thus early detection and timely treatment are effective solutions for lung cancer [2,3].Chest computed tomography (CT) is a frequent way for noninvasive lung cancer screening, and it contributes to decreasing the mortality of high-risk individuals [4,5].Lung nodule is the principal clinical manifestation of early lung cancer, and CT-based nodule location detection is an indispensable procedure in lung cancer screening [6][7][8][9].
In clinical diagnosis, to conduct a thorough examination, radiologists are usually required to read dozens or even hundreds of CT slices for each patient in a slice-by-slice manner, while such work is labor-intensive and easy to cause operator bias [10][11][12].As a result, computerized lung nodule position detection is an active research topic in the medical imaging analysis field, which aims at assisting clinicians to improve diagnostic efficiency [13][14][15][16][17][18].As illustrated in Fig. 1, lung nodules change greatly in scale, appearance and intensity, and they may occur anywhere in lungs and are often surrounded by complex background tissues.Therefore, it is crucial to extract 3D multi-scale discriminative features for achieving accurate nodule detection.In recent years, many methodologies have been designed for automated lung nodule location, and they could be grouped into two main patterns, including traditional detection modes and deep learning-based fashions [19][20][21].In the former detection means, threshold-based algorithms, morphological operations, energy optimization approaches and clustering methods are generally used to detect nodules from CT images [22].For example, Rezaie et al. [23] first exploited image segmentation and threshold limit to select the regions of interest which may have nodule objects, and then an edge detection method was used for nodule location.EI-Regaily et al. [24] presented a hybrid algorithm to detect nodules that incorporates several traditional approaches, including thresholding algorithm, region growth and morphological operation.Generally, when candidate nodules are located, some hand-crafted features such as shape, intensity, context feature and texture feature are further designed for false positive reduction [25][26][27].However, the traditional nodule detection methods heavily depend on domain experiences to extract features, and they could merely describe single or partial nodule characteristics.
Deep learning possesses excellent feature extraction capability, and it has shown outstanding performance in the computer vision field [28][29][30][31][32]. Convolutional neural network (CNN) is one of the most influential deep learning paradigms, and a variety of CNN-based models have been developed for the medical image analysis work [33][34][35][36][37].In the nodule detection task, 2D CNN-based approaches have been widely employed [38][39][40].Xie et al. [41] constructed a 2D CNN-based nodule detection architecture composed of three sub-models with the same network structure, in which each sub-model consists of a feature extraction module, a deconvolutional layer and two region proposal modules, and the results produced by the three sub-models are merged to acquire nodule candidates.Nguyen et al. [42] employed a Faster Region-based 2D CNN structure as the backbone, and an adaptive anchor box size generating strategy was proposed to improve the nodule detection sensitivity.Ramachandran et al. [43] introduced the You Only Look Once (YOLO) architecture which is a frequently-employed model for object detection in 2D natural image set into the lung nodule detection task, and they used a large-sized input to get good detection results.Although improved detection performance has been attained, it is difficult for 2D CNN models to learn the 3D spatial structure information of CT images, which may limit the further improvement of detection accuracy.
Recently, developing a 3D CNN-based learning model is the mainstream research for the analysis tasks of medical images with inherent 3D attributes [44][45][46][47][48][49].Wang et al. [50] exploited 3D convolutional networks to simultaneously learn spatial and spectral features of medical hyperspectral images, and a specific loss function was designed to improve network performance.Guan et al. [51] proposed an attention mechanism-based 3D model to automatically analyze brain magnetic resonance imaging data, the model introduced a squeeze-excite module and attention guide filter module to enhance feature learning ability.Dou et al. [52] designed a two-stage framework with 3D convolutional operation for lung nodule detection, they first established a fully convolutional network based on a training strategy with online sample filtering for candidate screening, then developed a hybrid-loss residual model to identify nodule objects from the candidates.Liao et al. [53] developed a 3D UNet-like structure to achieve automatic nodule location, and a set of 3D residual learning blocks was designed to extract 3D presentation information of lung nodules.Zhu et al. [54] developed a nodule detection framework called DeepLung, which can benefit both from the advantages of dense connection and residual learning by using a 3D dual-path network structure.Mei et al. [55] designed a slice-aware network where a slice grouped non-local structure is introduced for learning long-distance relationships in the feature map, and a new dataset was collected for estimating the nodule detection performance of the network.Lin et al. [56] presented a 3D nodule detection architecture named Inception Residual UNet++ (IR-UNet++), the IR-UNet++ model combined the ResNet and Inception as the building block, and embedded a squeeze-and-excitation architecture into the building block for better feature learning.Zhu et al. [57] employed a U-shaped 3D residual structure to achieve computer-aided lung nodule detection, and an improved attention gate and a channel interaction unit were designed to improve detection sensitivity.Jian et al. [58] developed a 3D convolutional model termed 3D Deep Attention and Global Search Network (3DAGNet) for lung nodule detection, the 3DAGNet included a global and channel module to strengthen the global and spatial information learning capability, and a multi-layer module to capture the multi-level feature.Xu et al. [59] proposed a slice grouped domain attention (SGDA)-based nodule detection method to enhance the generalization performance, the SGDA module worked in the axial, sagittal, and coronal directions for exploring the inter-dependencies of each group feature mapping in different directions.However, the above-mentioned methods are based on CNN with fixed kernel size and single-mode pooling operation, and this will limit their ability to describe complex lung nodule CT images with variable lesion size, appearance and density.
To address the aforementioned limitations, a multi-kernel driven 3D convolutional neural network (MK-3DCNN) is proposed to improve nodule detection accuracy.Unlike conventional two-stage detection frameworks (including nodule candidate detection and false positive reduction), our approach abandons the false-positive reduction procedure and trains an end-to-end model to achieve automated nodule detection in a one-stage learning paradigm.The main contributions of this work are concluded as follows.
(1) To overcome the limitation that traditional single receptive field-based convolutional networks are difficult to effectively cope with the variable imaging morphology of lung nodules, a multi-kernel joint learning algorithm is proposed to fully explore the 3D multi-scale discriminative information of lung nodule CT images, which contributes to improving nodule detection performance.
(2) A multi-mode mixed pooling strategy integrating the max pooling, average pooling, and center cropping pooling is designed to surrogate the conventional single-mode pooling fashion, these three different types of pooling operations can complement each other, and more comprehensive nodule descriptions can be obtained in this way.
(3) To evaluate the effectiveness of our MK-3DCNN in clinical application, a new dataset CQUCH-LND annotated through the biopsy-based cytological analysis is collected.Experimental results on the public dataset LUNA16 and the clinical dataset CQUCH-LND prove that the MK-3DCNN achieves superior performance than some state-of-the-art nodule detection approaches and possesses a good generalization ability.
The rest of this article is arranged as follows.Section 2 describes the two employed datasets LUNA16 and CQUCH-LND.In Section 3, the proposed MK-3DCNN method is detailed.To evaluate the nodule detection performance of the MK-3DCNN, experimental results on the LUNA16 and the CQUCH-LND are provided and analyzed in Section 4. The advantages and disadvantages of the proposed method are discussed in Section 5. Section 6 concludes this paper.

Dataset description
In this study, the most commonly used public dataset LUNA16 [13] and the Chongqing University Cancer Hospital Lung Nodule Diagnosis (CQUCH-LND) dataset constructed from a grade-A tertiary cancer hospital are adopted for evaluating the lung nodule detection performance of the developed MK-3DCNN framework.

LUNA16 dataset
The LUNA16 is the largest public dataset for computerized lung nodule location in chest CT images at present, and it is constructed from the well-known publicly available database LIDC-IDRI collected by 7 academic institutions and 8 medical imaging corporations [13,60].After removing those cases with missing slices, inconsistent pixel spacing and scanning thickness > 3 mm from the LIDC-IDRI, the LUNA16 has a total of 888 scans, and these scans are provided in the MHD/RAW format.
In the LUNA16, the slice number of each scan varies from 95 to 764, the scanning thickness alters from 0.45 to 2.5 mm, the pixel spacing changes from 0.46 to 0.98 mm, and all slices have the same size of 512 × 512 pixels.Moreover, for the LUNA16, only the nodules ≥ 3 mm and annotated by at least 3 out of 4 radiologists are considered as positive samples.In this sampling rule, 1186 lung nodule examples are collected in the dataset.The LUNA16 gives rich annotations including the center coordinate and scale information for each sampled lung nodule, as well as the lung segmentation images of all CT scans.In addition, the LUNA16 explicitly affords the patient-level data split based on a 10-fold cross-validation, and more details concerning the LUNA16 could be gained at https://luna16.grand-challenge.org/.

CQUCH-LND dataset
The CQUCH-LND is collected from Chongqing University Cancer Hospital, and it is consented by the review committee for use in this work.This dataset includes 263 low-dose CT scans (DICOM format) that are acquired from the Philips Brilliance 64 spiral CT scanner, and all CT data have been anonymized to protect patient privacy.In the CQUCH-LND, the slice quantity in each case alters from 128 to 715, the scanning thickness changes from 0.5 to 2.0 mm, the slice resolution ranges from 0.54 to 0.97 mm, and all slices are fixed to the scale of 512 × 512 pixels.In this dataset, a total of 263 lung nodules are labeled in the light of the biopsy-based cytological analysis, and the lung nodule scale alters from 3 to 30 mm, which is similar to that of the LUNA16 dataset.

Overview
Lung nodules possess great discrepancies in size, appearance and density, and exploring 3D multiscale discriminative representations is a remarkable approach to boost detection performance.Given this fact, a multi-kernel driven 3D convolutional neural network (MK-3DCNN) is proposed to fulfill automated lung nodule detection, and the general structure of the MK-3DCNN model is displayed in Fig. 2. As exhibited in Fig. 2, the MK-3DCNN framework uses a UNet-like encoder-decoder structure as the backbone network to utilize the multi-layer features of the deep model, and introduces a region proposal network (RPN) [61] as the output module to generate high-quality proposals.In the encoder part of the MK-3DCNN, a multi-kernel joint learning model is developed to capture multi-scale lung nodule information.Furthermore, a residual learning module combining a multi-model mixed pooling (M 3 P) operation is designed to learn more comprehensive descriptions of nodule CT images, which could relieve the problem of information loss caused by the traditional single-model pooling manner.In addition, the decoder part mainly involves three components, including the deconvolution layer, residual learning unit, and concatenation operation.In the following sections, the above contents will be detailed.For convenience, Table 1 sketches the mathematical symbols used in this paper.

Image preprocessing and input patch cropping
Thoracic CT scans in a dataset are generally gotten from diverse scanners and patients, which unavoidably leads to the inconsistencies of intensity distribution and spatial resolution (slice thickness and pixel pitch).As a result, standardizing original CT data is essential for data-driven models to achieve satisfactory detection performance.Moreover, due to the actuality that lung nodules only occur in the lung region, thus it is important to remove the background area for reducing computation overhead and improving detection accuracy.Figure 3 illustrates the main processes of our designed CT image preprocessing approach.
As shown in Fig. 3, we first normalize the image intensity of all original CT data to a unified distribution.Specifically, a window-level [ζ min , ζ max ] is employed to prune the intensity values of CT images, and then pruned values are further mapped to [0, ζ nor ].Then, the lung region mask are used to eliminate background areas, and this processing procedure mainly includes the following steps.(a) Mask images are binarized by setting a threshold.(b) Convex hull computation operation is performed to effectively involve the lung nodules sticking to the lung walls.(c) The lung region is extracted by multiplying the mask with the intensity normalized image, and the background area outside the mask is filled by using an intensity of common The max pooling The average pooling The center cropping pooling ψ Channel-wise squeezing operation N anc The number of anchors p gt The label of the anchored object pcla Predicted probability λ add Summation weight tissues.(d) The intensities of the background areas with high luminance (e.g., bone) are clipped for interference information suppression.In the public dataset LUNA16, the lung segmentation annotation provided in this dataset is employed as the mask label.For the clinical dataset CQUCH-LND, a threshold-based method [53] is introduced to obtain the lung region mask.
Further, the cubic spline interpolation approach is exploited to resample each scan to a unified spatial resolution γ × γ × γ.
After the above steps, redundant background areas are removed to improve computational efficiency.In addition, considering limited GPU memory, small 3D patches with a voxel size h × w × s (height × width × slice) are cropped from the pre-processed CT scans to be used as the inputs of the proposed model.Following the settings in [12,53,54], Data augmentation is a functional strategy for avoiding the overfitting problem of data-driven methods in small sample learning tasks.To increase the diversity and quantity of training samples, the extracted patches are stochastically flipped and rotated.Furthermore, inspired by several previous works [53,54], we also use a cropping ratio between 0.75 to 1.25 to augment the dataset.In this augmentation operation, if a cropped patch is smaller than the set input size, the patch is padded with a constant to adjust it to the input size.Similarly, if a cropped patch is larger than the set input size, the excess part is removed.The above size adjustment process is only performed on one side of each dimension of the patch to change the center position of the patch.

Multi-kernel joint learning block
In traditional lung nodule detection networks, the convolutional kernels in each layer are generally designed to share the same size, and this learning paradigm using fixed receptive fields is difficult to effectively capture the discriminative feature of nodule CT images with variable lesion sizes.Aiming at this restriction, a multi-kernel joint learning module (MKJLM) is developed to enhance the nodule detection ability.Furthermore, given the truth that the size information of a lung nodule will be gradually encoded into the high-level representations as the network depth increases and the scale of convolution kernel matters less compared to the case in shallower layers [62], two MKJLM modules with the same network structure are stacked and then embedded in the MK-3DCNN framework for continuous multi-scale feature learning from input patches.Figure 4 shows the first embedded MKJLM module.As described in Fig. 4, to effectively capture the feature information of lung nodule CT images with various lesion scales and appearances, three 3D convolution operations F 1 conv , F 2 conv and d are performed in parallel on the input X in ∈ ℜ 1×h×w×s , which can be defined as Then, an element-wise summation ⊕ is introduced for fusing the feature tensors generated by the multiple convolution branches: Subsequently, the global average pooling F gap is added to obtain the global information as G inf ∈ ℜ C×1×1×1 , and this operation can be described by Further, the output feature map of this multiple convolutional kernel collaborative learning module F out ∈ ℜ C×H×W×S can be obtained by where ⊗ represents element-wise multiplication.qc , kc and ṽc refer to the Softmax operation for the elements among q c , k c and v c , and the motive of this stage is to adaptively aggregate the feature maps produced by different convolution operations in an end-to-end learnable fashion.Furthermore, q c could be computed as in which φ denotes the activation function Rectified Linear Unit (ReLU), W 1 fc ∈ ℜ C/ r×C and W 2 fc ∈ ℜ C×C/ r are the weights of the fully-connected layers F 1 fc and F 2 fc , and r means reduction ratio.The computational procedures of k c and v c are like that of q c .

Multi-mode mixed pooling-based residual learning block
The pooling layer is an essential functional module in the standard CNN structure, and it is employed for scale reducing and information refining of feature maps.The average pooling and max pooling are two popular selections in existing CNN-based nodule detection approaches, but both of them only focus on a certain information component (average component or maximum component) in feature maps, which will inevitably cause information loss during the pooling process and is detrimental to the feature learning of lung nodule CT images with variable lesion sizes and image intensities (e.g., the intensities of solid nodules is significantly higher than that of ground glass nodules [10]).
Considering the above fact and inspired by the recent advance in residual learning [63], four multi-mode mixed pooling(M 3 P)-based residual learning modules are designed and placed after the multi-kernel joint learning block for progressive deep feature extraction.As shown in Fig. 5, each M 3 P-based residual learning module is composed of three 3D residual learning units and one M 3 P unit.Moreover, a residual learning unit includes two identical convolutional layers with kernel size k r × k r × k r , two batch normalization (BN) layers, two ReLU operations, and one residual connection.The learning procedure of the residual learning unit is defined as where ⊕ denotes element-wise summation, X l r ∈ ℜ C×H×W×S and X l+1 r ∈ ℜ C×H×W×S are 4D tensors of input and output in a residual learning unit, φ is the activation function ReLU, F res (︁ X l r , W res )︁ represents a learning function, and it is calculated as in which W 1 res and W 2 res denote the weights of two continuous 3D convolutional operations.In the MK-3DCNN framework, the traditional single-mode pooling computing is extended to the designed M 3 P operation which allows the extraction of more comprehensive sensitive information of lung nodules.As exhibited in Fig. 6, the M 3 P model reasonably integrates three different pooling operations to complement each other, including max pooling, average pooling, and center cropping pooling [64].

Concatenation
Channel-wise squeeze 6. Illustration of the developed M 3 P module.The input X l p of M 3 P is the convolutional feature gained from the residual learning unit, X cenp is the center area cropped from X l p , X maxp and X avep are max pooled and average pooled feature maps, respectively.The above three pooled features are concatenated to form the mixed map X mixp , and the X mixp is further processed through a channel-wise squeeze operation to obtain the final output X l+1 p of this module.
Given a feature tensor X l p ∈ ℜ C×H×W×S generated from the previous convolutional layer, the output feature map X l+1 p ∈ ℜ C×H ′ ×W ′ ×S ′ of the developed M 3 P model can be expressed as where [•] denotes concatenation operation, ψ represents a channel-wise squeezing operation, , and center cropping pooling F cenp (︂ X l p )︂ , respectively.In our proposed MK-3DCNN method, the size of the H ′ × W ′ × S ′ is half that of the H × W × S, and a convolutional layer with kernel scale 1 × 1 × 1 is exploited to achieve the channel-wise squeezing.

Model output
As exhibited in Fig. 2, in the decoding section of the MK-3DCNN framework, two deconvolutional layers and two residual learning modules are exploited to continuously decode the features extracted from the encoding part.Among them, the residual module is made up of three residual units in the designed M 3 P-based residual learning model (i.e.without the pooling part), and meanwhile, two concatenation operations are added to learn the multi-level features of lung nodule CT images.In addition, following the work in [53], the location information L inf of the proposal is introduced in the MK-3DCNN to get better nodule detection performance.
After decoding the learned features, two convolutional layers with the kernel scale 1 × 1 × 1 are used to map the obtained feature tensor to the result output with the dimension C d4 × (M/ 4) × (M/ 4) × (M/ 4).Further, the output 4D tensor is resized to 5 × N anc × (M/ 4) × (M/ 4) × (M/ 4) (i.e.C d4 = N anc × 5), in which N anc is the number of anchors.Inspired by the RPN, to cope with the variable lung nodule sizes, three different anchors are designed at every location, with the side lengths of 5, 10, and 20, respectively (ie. the value of N anc is 3).Moreover, the 5 regression values are (︁ g, ṽx , ṽy , ṽz , ṽr )︁ , the activation function Sigmoid is performed for the g, and this procedure is defined as pcla = 1 1+exp(−g) (8) In addition, a non-maximum suppression (NMS) operation [61] is introduced for reducing redundancy and optimizing detection results.

Loss function
The proposed MK-3DCNN framework is a one-stage lung nodule detection model, and it can concurrently predict nodule probabilities and nodule locations.The MK-3DCNN network is trained end-to-end exploiting a multi-task loss that contains a classification loss and a regression loss, and the overall training loss L dec is defined as (9) in which λ add denotes the summation weight, p gt ∈ {0, 1} means the label of an anchor box (1 for positive samples and 0 for negative samples), L cla and L reg are the classification loss and regression loss, respectively.In the MK-3DCNN architecture, the cross entropy loss [64] is adopted as the L cla , and it is defined by in which pcla means predicted probability.We denote the bounding box of an anchor as (︁ a x , a y , a z , a r )︁ and the ground truth bounding box of a lung nodule object as , in which the first three elements represent the center point coordinates and the last one means the size of the bounding box.The intersection over union (IoU) [61] is exploited for determining anchor label.Specifically, if the IoU between an anchor and the ground truth bounding box is larger than V ps , the anchor is labeled as a positive sample (i.e.p gt = 1), meanwhile, if the IoU is smaller than V ns , the corresponding anchor is served as a negative sample (i.e.p gt = 0).The other cases are not considered in the training procedure.
Moreover, the regression labels of bounding boxes are calculated by The predicted values corresponding to the above regression labels are ṽx , ṽy , ṽz and ṽr , respectively.Further, regression loss L reg could be defined as in which, the loss function smooth L1 [65] is exploited as the L smoothl1 , thus L smoothl1 is defined by In our experiments, V ps and V ns are set to 0.5 and 0.02, respectively.

Experimental design
In this work, the benchmark dataset LUNA16 and the clinical dataset CQUCH-LND are employed to evaluate the nodule detection performance of the presented MK-3DCNN, and several state-ofthe-art (SOTA) approaches are utilized for result comparing.To obtain reliable nodule detection results, we conduct 10-fold cross-validation experiments according to the dataset split provided in the LUNA16 (i.e., in one iteration, one fold of the dataset is exploited for testing and the others for training, and this operation is iterated until each fold has been tested).In the proposed MK-3DCNN framework, the convolution kernel size k r × k r × k r in the residual learning unit is set to 3 × 3 × 3, the sizes of the three convolution kernels in the MKJLM module are respectively set to 3 × 3 × 3 , 5 × 5 × 5 and 7 × 7 × 7 , the stride of all the above convolution operations is set to 1.The reduction ratio r is set to 2.Moreover, the spatial scale M (height=width=number of slices) is set to 96, the number of channels C e1 -C e5 are set to 24, 32, 64, 64 and 64, C d1 -C d4 are set to 64, 64, 128 and 15, respectively.In our experiments, the batch size is set to 16 based on the GPU memory, and the stochastic gradient descent operator is selected for optimization.Because the number of samples is small, a dynamic learning rate mechanism with an initial value of 0.01 is performed for training, and the learning rate decays ten times every 50 epochs.In the training, we employ 200 epochs in total to optimize the deep model to convergence.All experiments are conducted on a server that holds the following major configurations: 6 NVIDIA RTX TITAN GPUs, 2 Intel Xeon 3.6 GHz CPUs, 256 GB memory and Ubuntu 18.04.1 system.Moreover, the Python-based PyTorch library is used for implementing the developed MK-3DCNN model.

Evaluation metrics
Following previous studies [13], the free-response receiver operating characteristic (FROC) analysis is employed for evaluating the lung nodule detection performance of the proposed MK-3DCNN method.In the FROC curve, the recall is plotted as a function of the average number of false positives per scan (FPs/scan).The recall denotes a rate of the quantity of detected true positives to the quantities of all nodules, and it is calculated as Recall = TP det TP det +FN mis (14) where TP det and FN mis represent the number of detected and undetected nodules, respectively.Clearly, the sum of TP det and FN mis is the total nodule sample number.
Furthermore, the competition performance metric (CPM) [13,66] is introduced to extract one overall score from m the FROC curve.The CPM is defined as the mean of the recall at seven fixed false positives, and it can be computed by in which Recall (︁ i fps )︁ is the recall when the average number of false positives per scan is set to i fps ∈ {0.125, 0.25, 0.5, 1, 2, 4, 8}, and the value of N fps is 7. Obviously, the lowest possible CPM value is 0, and a perfect detection model will have a CPM with a score of 1.

Parameter sensitivity analysis
The summation weight λ add about the classification loss L cla and regression loss L reg is an important hyper-parameter in our proposed MK-3DCNN method, and it will affect the detection results.In this part, parameter sensitivity experiments are performed to analyze the influence of the summation weight and select the most appropriate value.To evaluate the detection performance concerning different weighting coefficients, the parameter λ add is tuned with a given set {0.1, 0.2, 0.3, . . ., 1.0}.In this experiment, we stochastically choose 90% of the CT scans in the LUNA16 as the training set and the other 10% as the test set.The histogram provided in Fig. 7 illustrates the variations of CPM regarding the parameter λ add .As can be seen from Fig. 7, the CPM first improves and then reaches a peak value with the increase of weight λ add .When the λ add exceeds a certain value, the detection result begins to decline slightly.The reason is that the introduction of classification loss can enable the deep model to learn the representation information of nodules more effectively and boost the detection ability.However, a large λ add value will weaken the influence of the position regression operation, and lead to a decrease in the detection performance.To balance the importance of the classification and position detection operations, the parameter λ add is set to 0.3 in the following experiments.

Ablation study
In the proposed MK-3DCNN framework, the MKJLM component and M 3 P component play key roles in achieving accurate nodule detection.To validate the contributions of different functional modules, a set of ablation studies is performed by constructing three comparison models, including the BaseNet, BaseNet-MK, and BaseNet-M 3 P. BaseNet is the baseline model (i.e. the MKJLM and M 3 P modules in the MK-3DCNN are replaced by the standard convolutional layer and the maximum pooling operation, respectively).BaseNet-MK model is designed via embedding the MKJLM into the BaseNet (i.e. the M 3 P module is removed in the MK-3DCNN).Similarly, the BaseNet-M 3 P model is a combination of the BaseNet and M 3 P (i.e. the MKJLM module is removed in the MK-3DCNN).Moreover, considering that our proposed MKJLM module includes the squeeze-and-excitation technique, we also designed two contrast models BaseNet-SE and BaseNet-SE-M 3 P.The BaseNet-SE model is constructed by placing a squeeze-and-excitation module [67] behind each of the two standard convolution layers of the baseline model BaseNet.The BaseNet-SE-M 3 P model is a combination of the BaseNet-SE and the M 3 P.
In addition, the standard deviation (SD) of the CPM and the statistical significance (P-value) of the FROC (versus the proposed MK-3DCNN) are added as evaluation metrics to fully evaluate the nodule detection performance of the MK-3DCNN.The Wilcoxon signed-rank test [68] is introduced for statistical significance testing.To ensure sufficient samples and effective P-value calculations, 6 additional sample points are extracted (one point between each two adjacent values in the 7 fixed false positives is extracted), resulting in a total of 13 observation points.
Table 2 reports the nodule detection results of the MK-3DCNN and the above three componentbased algorithms.As shown in Table 2, the developed MKJLM and M 3 P modules can improve the nodule detection capability of the deep model.Specifically, the detection performance gains obtained from the multi-kernel joint learning (BaseNet-MK vs BaseNet) and multi-mode mixed pooling learning (BaseNet-M 3 P vs BaseNet) are 1.2% and 0.92% in terms of CPM, respectively.Furthermore, the CPM of the BaseNet-SE-M 3 P is higher than that of the BaseNet-SE, which also indicates the effectiveness of the designed M 3 P module.Obviously, the proposed MK-3DCNN method achieves the highest CPM with a low SD value, and the lung nodule detection performance of the MK-3DCNN is better than all the component-based contrast approaches at the P < 0.05 level.The aforementioned experimental results prove that the developed MKJLM module could effectively cope with the intrinsic characteristics of the variable nodule scale through cooperatively learning the 3D multi-scale spatial information of nodule CT images.By designing a multi-mode mixed pooling architecture, the M 3 P module reasonably incorporates the max pooling, average pooling, and center cropping pooling operations to simultaneously learn high-intensity, low-intensity, and scale information, thus it can strengthen the nodule detection ability in complex lung environments compared to traditional single-mode pooling-based learning manner.The FROC curves in Fig. 8 visually display the superior nodule detection capability of the MK-3DCNN compared with the component-based models.

Validity analysis of data augmentation operation
Data augmentation is a frequently used method for resolving the overfitting issue of the deep learning model in the nodule CT image analysis work with limited training samples.In this work, an augmentation approach based on flipping, rotating, and resizing operations is introduced to   9 shows the nodule detection performance of the MK-3DCNN model under different conditions.As illustrated in Fig. 9, compared with the comparison method MK-3DCNN without DA, the proposed MK-3DCNN method achieves higher recalls at six of the seven set false positive rates, and the CPM gain reaches 0.79%.The above experimental results evidence that the designed data augmentation operation can effectively improve the nodule detection performance of the deep model by expanding the diversity of samples.The FROC curves in Fig. 10 intuitively show the validity of the used data augmentation strategy.

Comparison with some state-of-the-art methods
To comprehensively assess the lung nodule detection ability of the proposed MK-3DCNN, several SOTA approaches are chosen for comparison on the LUNA16.Following the previous researches  [48,52], the nodule detection results provided in corresponding articles are employed for contrast, and the particulars are reported in Table 3.All these results listed in Table 3 are produced by one-stage detection models to make a fair comparison.In Table 3, the 3D-RES [54] and the LNOR-Net [53] construct an encoder-decoder structure as the backbone of the nodule detection model to learn multi-layer information.In the 3D-DPN [54], a dual path block is designed to simultaneously explore the advantages of residual learning and dense connection.The YOLOv3-Net [14] is a YOLO architecture-based method.The SGDA  [59] is an attention mechanism-based model.In addition, a multi-scale feature learning-based approach MSANet [48] is also chosen for comparison, in which a multi-scale attention block is constructed to boost nodule detection sensitivity.According to Table 3, due to the utilization of multi-scale features of nodule CT images, the MSANet method gets good detection results.It is obvious that our developed MK-3DCNN framework obtains more competitive nodule detection performance (the highest CPM value) when compared with the aforementioned SOTA methods.Owing to variable lesion sizes and complicated anatomic structures, extracting 3D discriminative features is critical for achieving accurate nodule position detection.In the proposed MK-3DCNN method, a multi-kernel joint learning module is constructed to fully learn the 3D multi-scale spatial information of nodule CT images.By building a multi-mode mixed pooling-based residual learning block for feature extraction, the MK-3DCNN model can effectively alleviate the issue of information loss in traditional single-mode pooling-based detection models, as a result, more discriminative nodule representations can be obtained.

Visual analysis of detection results
To further analyze the nodule detection capability of the proposed MK-3DCNN model, some representative examples in the detection results generated by the MK-3DCNN model and the baseline model BaseNet (i.e.without the MKJLM and M 3 P modules) are illustrated in Fig. 11.Since thoracic CT scan is volumetric imaging data, only the central slice where a detected nodule is located is plotted for visualization.In Fig. 11, the red rectangular boxes in the first row of images anchor the position ground truths of nodule samples, the blue rectangular boxes in the second and the third rows of images anchor the detection results produced by the BaseNet and MK-3DCNN models, respectively.The central slice number is provided below each image.The second row below each image of the detection result part exhibits the nodule detection probabilities.The side length of the rectangular box corresponds to the nodule scale.
As shown in Fig. 11, both our proposed MK-3DCNN method and the contrasted method BaseNet can achieve good detection results for those nodules with noticeable visual differences from the background, such as #1 and #2 nodule samples.Furthermore, when nodule lesions share many visual similarities with surrounding tissues (e.g., #3, # 4 and #7 nodules are similar in size and appearance to neighboring tissues, #5, #6 and # 8 nodules have similar intensity to the background), the baseline approach BaseNet cannot effectively locate the nodule objects, which will result in low detection confidence value, even missed detection.By comparison, the proposed MK-3DCNN model is able to locate the nodule bodies more accurately, and produce better detection performance.The above results visually demonstrate that the MK-3DCNN can effectively cope with variable nodule size, appearance and intensity, and work well to detect nodules from complex lung environments.

External validation on clinical dataset CQUCH-LND
In clinical practice, the golden standard in lung nodule diagnosis is the cytological analysis based on biopsy, not just the radiological characteristics.In view of this, the CQUCH-LND dataset annotated through the diagnosis golden standard is built and exploited to validate the generalization capability of our proposed MK-3DCNN method.
To fully assess the nodule detection performance of the deep model, two kinds of experiments are performed on the CQUCH-LND: (1) the MK-3DCNN model and three component-based comparison models that have been trained on the LUNA16 are directly tested using the CQUCH-LND, and (2) the trained MK-3DCNN and three comparison models are fine-tuned via employing 2-fold cross-validation on this dataset.As with the experiments on the LUNA16, the FROC analysis is exploited to quantitatively evaluate the detection performance.Similar to Table 2, the standard deviation (SD) of the CPM and the statistical significance (P-value) of the FROC are added as evaluation metrics to fully evaluate the nodule detection performance of the our proposed MK-3DCNN method.In direct testing and fine-tuning experiments, the P-value is calculated versus the MK-3DCNN and the Fine-tuned MK-3DCNN, respectively.
As can be observed from the detection results listed in Table 4, fine-tuning operations attain better detection performance compared to directly testing experiments, and the CPM of the MK-3DCNN, BaseNet-M 3 P, BaseNet-MK and BaseNet increases from 0.8523, 0.8322, 0.8419 and 0.8229 to 0.8642, 0.8479, 0.8561 and 0.8360, respectively.In both fine-tuning and direct testing experiments, the lung nodule detection performance of our proposed MK-3DCNN approach is better than all contrast approaches at the P < 0.05 level, which indicates that the MK-3DCNN has a good prospect in real application.The FROC curves in Fig. 12 intuitively display the detection performance of the developed MK-3DCNN and three component-based contrasted models.Moreover, the SD value of the MK-3DCNN is not lower than that of all the comparison methods.This is because there are too few samples to cover all types of lung nodules, resulting in a certain fluctuation in the detection results.In the future work, constructing large-scale clinical datasets to provide dataset support for the development of high-performance models is one of the promising directions.In addition, the visualization of several representative nodule detection results produced by fine-tuning the MK-3DCNN and the BaseNet on the dataset CQUCH-LND is shown in Fig. 13.
Similar to the detection results on the dataset LUNA16, for easy samples (e.g.#1 and #2 nodule samples), both the proposed MK-3DCNN approach and the comparison approach BaseNet can obtain good detection performance.Furthermore, for some hard samples (e.g.#3-#8 nodule samples), the MK-3DCNN is significantly superior to the BaseNet.These experiment results prove the effectiveness of the MK-3DCNN in the nodule detection task.

Limitation
Although the developed MK-3DCNN model has achieved promising results on both the public dataset LUNA16 and the clinical dataset CQUCH-LND, there are still some limitations that need to be further considered.On the one hand, the 3D deep learning network with large parameters is required to be fully trained, thus the MK-3DCNN possesses high computational overhead.In the future, we will focus on designing efficient optimization algorithms to accelerate the training of the MK-3DCNN.On the other hand, as exhibited in Fig. 14, several ground glass nodules have small scales and low densities, which makes the model unable to accurately detect them.Likewise, some background organizations share many visual similarities with nodules in terms of shape and size, which results in them not being correctly identified by the MK-3DCNN.One solution is to combine the theories of graph learning and manifold learning to achieve more reasonable characterizations of nodule CT images.

Missed detection cases
False detection cases

False positive reduction experiment analysis
As with some previous related work [53,54,59], our study focuses on developing a one-stage end-to-end 3D model for automated detection of lung nodules in chest CT scans.To further evaluate the performance of the proposed MK-3DCNN method, and considering limited nodule samples, we introduce a 3D self-supervised transfer learning method [69] to conduct an additional false positive reduction (FPR) experiment on the benchmark dataset LUNA16.
In the FPR process, a 3D encoder-decoder structure with residual connection [69] is used to implement self-supervised pre-training to learn valuable representation information from large amounts of randomly cropped unlabeled data, which helps reduce the dependence on labeled samples.Then, the pre-trained encoder part is transferred as the feature extractor, and the global average pooling operation is exploited to convert the feature map generated from the last convolutional layer of the encoder into a 512-dimensional feature vector.Finally, a classifier consisting of two fully connected layers (the number of neurons is respectively set to 512 and 256) and a Sigmoid unit is constructed to achieve the FPR.In this experiment, five image perturbation strategies (nonlinear transformation, local pixel shuffling, local pixel swapping, inner pixel cutout, and outer pixel cutout [69,70]) are integrated to enhance the image representation ability of the self-supervised learning network.Furthermore, conventional image rotation and image flipping approaches are used for data augmentation.The mean square error loss function and the stochastic gradient descent (SGD) optimizer with an initial learning rate of 1.0 are selected for self-supervised pre-training, and the cross-entropy loss function and the adaptive moment estimation (Adam) optimizer with an initial learning rate of 0.001 are adopted for the FPR training.The learning rate will be halved when the model performance is not improved over 10 epochs, the input size is set to 64 × 64 × 32, the batch size is set to 32, and the early stop mechanism is employed to get a better model.
The nodule detection performance of our proposed MK-3DCNN with the FPR procedure (MK-3DCNN-FPR) and some existing models with the FPR step are contrasted in Table 5.In Table 5, the CNN-OSFHRL [52] attempts to solve the easy/hard sample imbalance issue by designing an online sample filtering algorithm.The DeepMed [71] develops a lightweight network to overcome the small sample problem.The NSADC-CNN [72] and the AA-3DCNN [9] are committed to solving the challenge of variable nodule sizes.For the A-CNN [73] and the DS-CMSF [3], a multi-stage detection framework is employed to achieve more accurate nodule detection.In the V-Net-SVM [74], a hard mining scheme is designed to improve the FPR performance.The SANet [55] tries to enhance the detection capability of the deep learning model by exploring long-range dependencies among one slice group and channels of the feature map.Compared to the above existing methods with the FPR process, the MK-3DCNN-FPR method obtains better lung nodule detection performance.Besides, from Table 3 and Table 5, it can be found that our proposed method achieves competitive performance compared with some SOTA one-stage, two-stage, and multi-stage methods, which indicates the effectiveness of our method in the lung nodule detection task.

Advantage and disadvantage analysis
Compared with the existing lung nodule detection studies, our work has the following advantages.
(1) We propose a deep learning-based one-stage 3D lung nodule detection model that does not require human intervention and can achieve automated nodule detection in an end-to-end trainable way.Unlike traditional radiomics algorithms, our model is based on a neural network, and it does not demand hand-crafted feature designing.Different from frequently used 2D models that need to combine 2D proposals into 3D proposals, our method can directly generate 3D detection results.(2) The proposed method is tested and verified in multiple aspects as follows: the parameter sensitivity experiment is used to select the optimal parameter; the ablation study is conducted to demonstrate the effectiveness of the developed key modules; the visualization experiment is designed to intuitively analyze the detection performance of the proposed method; and the external validation experiment is performed to validate the generalization ability of our model.(3) We construct a new lung nodule CT dataset.The gold standard for lung nodule diagnosis is biopsy-based cytological analysis, not just the radiological characteristics.Therefore, a diagnosis gold standard-based clinical dataset is built to evaluate the detection performance of the proposed method in practical application.There are also some disadvantages in our study as follows: (1) applying 3D CNN to lung nodule detection can capture rich spatial information of nodule CT images and achieve one-stage detection, but it occupies more memory compared to 2D networks, so the batch size and running speed are limited.Image patches are used instead of the entire CT scan as input to eliminate this issue.(2) The benchmark dataset LUNA16 only contains 888 CT scans and 1186 lung nodule objects, and too few samples are not enough to cover all types of lung nodules.The data augmentation strategy is designed to alleviate the model overfitting problem.(3) Although our method can deal with the problem of variable size, shape, and density of lung nodules in chest CT images to a certain extent, there are still some gaps compared to several top-level methods on the LUNA16 Grand Challenge.In future work, we will introduce new foundation model-based techniques (e.g.generative pre-trained Transformer) to further improve the lung nodule detection performance.

Conclusion
In this paper, a multi-kernel driven 3D convolutional neural network (MK-3DCNN) is developed for the automatic detection of lung nodules in thoracic CT scans.The MK-3DCNN method adopts a residual learning-based encoder-decoder structure as the backbone to exploit the multi-layer features of the deep network.Different from previous traditional convolutional networks with fixed kernel size, a multi-kernel joint learning block is designed to drive the detection model to capture 3D multi-scale spatial information from the nodule CT images with variable lesion sizes and shapes.In addition, a multi-mode mixed pooling strategy is proposed to surrogate the conventional single-mode pooling way, the designed pooling method reasonably incorporates three different types of pooling operations, including max pooling, average pooling, and center cropping pooling, and they can complement each other to attain more comprehensive nodule CT image representations.To fully evaluate the validity of the presented MK-3DCNN, systematic experiments are performed on the public dataset LUNA16 and the clinical dataset CQUCH-LND, and experimental results indicate the MK-3DCNN method outperforms some SOTA nodule detection approaches and possesses a good generalization ability in the clinical practice.

Fig. 1 .
Fig. 1.Visualization of lung nodules on transverse, sagittal and coronal planes in thoracic CT scans, and it can be found that lung nodules change greatly in scale, shape, density, and lesion location.The lung nodule positions are marked by red arrows.

Fig. 2 .
Fig. 2. The general framework of the presented MK-3DCNN method.M represents the spatial scale (height=width=number of slices), C e1 -C e5 and C d1 -C d4 denote the channel number, and L inf means the embedded position information.The MK-3DCNN is trained endto-end adopting a multi-task loss L dec composed of regression loss L reg and classification loss L cla .

Fig. 5 .
Fig. 5.The structure of the developed M 3 P-based residual learning module.The spatial sizes of feature tensors X l r , X ′ r , X ′′ r and X l+1 r are kept at H × W × S by a padding strategy, the scale of the convolutional kernels K ′ r and K ′′ r are both k r × k r × k r , The shortcut between X l r

Fig. 7 .
Fig. 7.The lung nodule detection performance with different values of weight coefficient λ add on the LUNA16.

Fig. 8 .
Fig. 8. Comparisons of FROC curves generated by the proposed MK-3DCNN and different components-based methods.

Fig. 9 .
Fig. 9.The nodule detection performance of the proposed MK-3DCNN in the cases with and without data augmentation operation.

Fig. 10 .
Fig. 10.Comparisons of FROC curves generated by the proposed MK-3DCNN method in the cases with and without data augmentation operation.

Fig. 11 .
Fig. 11.Visualization of several representative lung nodule detection results yielded by the proposed MK-3DCNN and comparison model BaseNet.The red and blue rectangular boxes show the position ground truths and detection results in the central slices, respectively.The central slice number is provided below each image.The second row below each image of the detection result part shows the detection probabilities.The side length of the rectangular box is relative to the nodule scale.The Null indicates a missed detection case.

Fig. 12 .
Fig. 12. Comparisons of FROC curves produced by the presented MK-3DCNN and the contrasted methods on the clinical dataset CQUCH-LND.

Fig. 13 .
Fig. 13.Visualization of some representative nodule detection results generated by the fine-tuned MK-3DCNN and BaseNet models on the clinical dataset CQUCH-LND.The red and blue rectangular boxes illustrate the position ground truths and detection results in the central slices, respectively.The central slice number and detection probabilities are provided below the images.The side length of the rectangular box corresponds to the nodule size.The Null denotes a missed detection case.

Fig. 14 .
Fig. 14.Missed detection and false detection cases by exploiting the MK-3DCNN on the LUNA16 dataset.

Funding.
National Natural Science Foundation of China (42071302); Innovation Program for Chongqing Overseas Returnees (cx2019144); Graduate Research and Innovation Foundation of Chongqing (CYB21060); Visiting Scholar Foundation of Key Laboratory of Optoelectronic Technology and Systems (Chongqing University), Ministry of Education.

Table 1 . Notations and definitions.
nor The maximum intensity of normalized images 2 γ × γ × γ The spatial resolution of normalized images 3 h × w × s Size of cropped input patch

7099 0.7723 0.8356 0.8836 0.9174 0.9384 0.9562 0.8591 0.0846 - increase
sample quantity.To assess the effectiveness of the designed data augmentation strategy in the nodule detection task, a controlled experiment is organized by devising a comparison model MK-3DCNN without DA.The MK-3DCNN without DA has the same network structure as the developed MK-3DCNN model, but data augmentation processing is not used during the model training.Figure