Lung-RetinaNet: Lung Cancer Detection Using a RetinaNet With Multi-Scale Feature Fusion and Context Module

Lung cancer is one of the terrible diseases in various countries around the globe, and timely detection of the illness is still a challenging process. The oncologists consider the blood test results and CT scans to assess the tumor, which is time-consuming and involves extra human effort. Therefore, an automated system should be developed to efficiently recognize lung tumors and assess their severity to reduce mortality. Although various researchers have proposed lung disease detection systems, the existing techniques still lack significant detection accuracy for early-stage tumors. Thus, this study proposes a novel and efficient lung tumor detector based on a RetinaNet, namely Lung-RetinaNet. A multi-scale feature fusion-based module is introduced to aggregate various network layers, simultaneously increasing the semantic information from the shallow prediction layer. Moreover, a dilated and lightweight algorithm is employed for the context module to combine contextual information with each network stage layer to improve features and effectively localize the tiny tumors. The proposed methodology attained 99.8% accuracy, 99.3% recall, 99.4% precision, 99.5% F1-score, and 0.989 Auc. We evaluated our suggested model and matched the performance with state-of-the-art DL-based methods. The outcomes show that our technique provides more substantial results than existing methods.


I. INTRODUCTION
Lung cancer is considered one of the most terrible illnesses in various countries, and its death rate is 19.35% [1]. The multiple ways used by radiologists to detect cancer include sputum cytology, CT scans, X-rays, and other magnetic resonance imaging techniques. During the detection process, tumors are categorized into two categories: malignant and benign. Malignant tumors are cancerous and proliferate, having irregular shapes and sizes. It is also analyzed that the endurance rate for patients of the advanced stage is much less than for cancer diagnosed early. It has also been assessed that the timely analysis of the scans and imaging can be improved The associate editor coordinating the review of this manuscript and approving it for publication was Gustavo Callico . by employing various image processing methods [2]. Various research works have been proposed for detecting early-stage cancers using image processing methods. Two main challenges may reduce the hit ratio of lung cancer detection manually. First is the technical and human accessibility, as radiology resources may be inadequate to meet the demand [3]. Second, a significant number of false positive cases are due to the first shortcoming. Therefore, high-quality training should be provided to the radiologists interpreting the images. Thus, the precision of detection and categorization for existing techniques still requires improvements.
Recent progress in machine learning (ML) and deep learning (DL) techniques has resulted in a significant shift towards computer-aided detection (CAD) systems for lung cancer detection. There exist some of the traditional ML-based techniques in the literature aiding lung cancer detection and classification, for example, Support Vector Machine (SVM), Random Forest (RF), and K-Nearest Neighbors (KNN) [4]. These techniques perform manual feature extraction, and then the classifier is trained using extracted features. Moreover, dealing with the various features is tiring and requires extra time. Besides this, the ML-based model gets training over a small number of samples which causes a generalization problem. Some techniques use segmentation methods for lung cancer diagnosis [5]. The region of interest (ROI) is selected based on texture, color, or grayscale from the original image as a segmented region. The standard techniques for segmentation include Region growing, Atlas, and Thresholding. The segmentation-based techniques' performance highly depends upon the segmented area and its extracted features. Many results have been attained using segmentation-based methods for lung cancer detection. However, these techniques still failover unseen samples and require modification to reduce the false ratio.
With the development in the domain of deep learning, DLbased techniques have shown superior results for the recognition of numerous diseases such as knee [6], eyes [6], potato leaves [7], brain [8], and computer vision techniques [9]. The primary benefit of DL-based models is the automatic features extraction phase, whether the model is segmentation-based or classification-based. Furthermore, DL-based models extract the most representative features due to the continuously increasing depth of the layers. The DL models comprise pooling, batch normalization (BN), convolutional, and fully connected layers. The pooling layers also minimize the feature maps' size while reducing the model's complexity. Various DL-based models have been proposed for lung tumor detection. However, most methods are based on simple classification [10]. These classification techniques consider the whole image for feature extraction, which may increase the misdiagnosis of early-stage tumors. Therefore, to solve the issues mentioned above and enhance early lung tumor detection performance, we propose a novel and improved deep learning-based model based on Reti-naNet [11]. We offer a feature fusion block instead of a feature pyramid network (FPN) in RetinaNet to mine the most representative feature maps minimizing the loss of exhaustive information from input. Moreover, a dilated convolution utilizes the most powerful features of tiny tumors at shallow layers. The features from upper layers posing high classification accuracy are integrated with the bottom ones exhibiting high localization results. Utilizing contextual information during feature fusion adds more features from lower layers using a contextual block. Furthermore, the default anchors were not performing well for tiny lung tumors due to their irregular shapes and sizes. Therefore, a k-means clustering method is adopted, as used in YOLO-v3 [12], to generate precise anchors along with contextual feature fusion blocks. The outcomes show that our suggested model efficiently detects tiny lung tumors and classifies them. The main offerings of the suggested work are as follows: • This research aims to detect and segment lung tumor automatically at an early stage. Therefore, we propose a robust multi-scale feature fusion-based module to aggregate various layers of the network, simultaneously increasing the semantic information from the shallow prediction layer.
• We propose a dilated and lightweight algorithm for the context module to combine contextual information with each network stage layer to improve features and effectively localize the tiny tumors.
• RetinaNet relies on local features for detection and lacks contextual information. Therefore, an improved Lung-RetinaNet combines the dilated context module having lateral connections at each network level and features fusion block, enriching the features at each stage. Additionally, the usage of adaptive anchors was able to detect lung tumors better.
• The proposed Lung-RetinaNet is evaluated using wellknown benchmarks, and experimental outcomes indicate that our proposed model significantly outperforms existing lung tumor detectors. The remaining paper is prepared as follows: Section II shows the related work, Section III refers to the projected model, Section IV evaluates the proposed model using various experiments, and Section V concludes the work.

II. LITERATURE REVIEW
The deep learning models have been used vastly for the cancer detection in last few decades [13], [14], [15], [16].The key attribute for lung cancer detection is the pulmonary nodules and solid clumps made up of tissues surrounding the lungs [17]. The nodules present in the lungs can be malignant or benign depending upon the nature of the cells and viewable on CT scans [18]. In 2015, Hua et al. [19] employed methods of CNN and DBN to classify lung nodules using CT scan imaging. They exhibited that deep learning techniques help extract lung swellings features and categorize them into malignant or benign without considering texture feature. Kurniawan et al. [20] and Rao et al. () [21] utilized CNN employing simple and straightforward classification to distinguish lung tumors using CT scans. Song et al. [18] performed a comparative analysis of the deep learning model, CNN and auto-encoder for lung tumor detection. They mentioned that the CNN based on classification performs better than DL and stacked auto-encoder models. Ciompi et al. [22] employed CNNs based on a multi-scale to classify lung cancer into six types: calcified, solid, part solid, non-solid, speculated, and perifissural nodules. They specifically developed a multi-stream artificial neural network that can handle multiple triplets of 2-dimensional views of lung nodules at different scales and then computes the likelihood of the presence of any tumor among six classes. Yu et al. [23] developed a system performing bone exclusion and lung segmentation before CNN's training. Shakeel et al. [24] proposed a system based on segmentation using pre-processed images. In pre-processing, the authors utilized denoising methods and VOLUME 11, 2023 enhanced the image's appearance. In the end, the artificial neural network has been trained to categorize lung cancer.
Ardila et al. [25] proposed a system based on four modules: segmentation, cancer detection model, cancer risk estimation model, and full-volume model. After the lung segmentation, the ROI identification model estimates the maximum nodules region, whereas the full-volume module was trained to estimate cancer likelihood. The combined output is formed from these two modules as the final output for the prediction. Chen et al. [26] established a system for nodule recognition based on nodule segmentation and enhancement. Hosny et al. [27] and Xu et al. [28] employed CNN for lung cancer detection after data augmentation. They utilized various methods for data augmentation, such as flipping, rotation, and translation. A DL-based technique using temporal attention, namely visual simple temporal attention (ViSTA) in CT images [29]. 351 nodes were utilized in the work. The proposed model attained 86.4% area under the curve (AUC). The authors in [30] utilized the LUNA16 dataset for training the cancer detection model and then modified the model using the KDSB17 dataset for global feature extraction. After that, they combined local features attained from another separate classifier and achieved higher accuracy for lung nodule detection. In [31], authors employed transfer learning to train the model multiple times. The classifier was previously trained on the ImageNet dataset and then trained using the ChestX-ray14 dataset.
In the end, the JSRT dataset was used to test lung cancer detection. In [32], only a survey was performed for lung cancer detection using histopathology imaging. Squamous cell carcinoma (LUSC) and Adenocarcinoma are common types of cancers, and for their detection, the pathologist should be a very experienced person. In the proposed study, a CNN was trained on the images of slides during histopathology to accurately classify LUSC, LUAD, and normal tissues. Xu et al. [33] utilized a CNN based on LSTM for lesion detection in X-ray imaging. LSTM is a modified network of RNN and improves the architecture to reduce the issues of vanishing gradient. The proposed network exhibited a significant relationship among lesions to predict cancer precisely.
Abd El-Wahab et al. [34] developed a system using several versions of EfficientNet (B0, B1, B2) to identify the various lungs disease. The authors fused the features of pre-trained models and then they passed the features from stacked ensemble learning. The final classifier consisted of SVM and Random Forest (RF) at first phase and logistic regression at second stage. The maximum accuracy attained by the system was 99% for detection of TB in lungs. Aswathy et al [35] employed a system for the lung malignancy detection using nano-image as input and processed through a Gabor filter and color-based histogram equalization for enhancement. The segmented image of lung cancer was then obtained using the Guaranteed Convergence Particle Swarm Optimization (GCPSO) algorithm. To classify the tumor region, a graphical user interface nano-measuring tool was developed. Additionally, the Bag of Visual Words (BoVW) method and a Convolutional Recurrent Neural Network (CRNN) were utilized for feature extraction and image classification. The authors attained 99.35% accuracy for cancer detection. In [36],two benchmark datasets are downloaded, containing attribute information from multiple patients' health records. Principal Component Analysis (PCA) and t-Distributed Stochastic Neighbor Embedding (t-SNE) are employed as standard techniques for feature extraction. In addition, deep features are obtained from the pooling layer of a Convolutional Neural Network (CNN). To select the most relevant features, the Best Fitness-based Squirrel Search Algorithm (BF-SSA) is utilized, which is considered an optimal feature selection method. They attained an accuracy of 93.15%.

III. METHODOLOGY
In this section, we demonstrate the proposed methodology. There exist two types of network-based detectors: single and two-stage. The two-stage detectors, including RCNN, CNN, Mask RCNN etc., are based on a region proposals network that finds ROI. Further, these regions are utilized for classification and regression. These models maintain high precision and accuracy; however, they become very slow due to their architecture. Moreover, some single-stage detectors are fast, including a single-shot multi-box detector. However, they may face the issue of class imbalance during the training  phase. Therefore, our work employs a single-stage detector for lung tumor detection. The focal loss also helps the network to overcome the problem of class imbalance. The basic flow diagram of the proposed system is shown in Figure 1.
An improved RetinaNet's architecture is shown in Figure 2. We have introduced a context aggregation module instead of FPN in the original RetinaNet. Our proposed model comprises a ResNe101 as a backbone network for extracting features from input, along with dilated contextual blockbased fusion, categorization and regression sub-networks. The main aim of these sub-nets is to localize and classify lesions of various sizes at varying scales. In Figure 2, the FPN is exchanged with our proposed context module. First, the final three layers are based on 1 × 1 convolution to reduce dimensions. Further, L4 and L5 are un-sampled using the bilinear approach for converting them to the same size as L3. A ReLU activation is employed, followed by BN in the fusion module. Then, our dilated context block is applied to L3_minimized, L4_minimized, and L5_minimized. In the end, our proposed feature fusion and dilated context block are merged for each backbone instant using a bottom-up way. The layers p6 and p7 are used for the large-size tumors.
RetinaNet is a model used for object detection that specifically addresses the challenge of class imbalance in object detection tasks. This issue arises because the majority of regions in an image do not contain objects, resulting in a substantial disparity between the positive and negative classes. The Huber loss is used to compute the localization loss of the model, which is the loss incurred when the model predicts the bounding boxes of the objects in the image. Hence, the Huber loss is more resistant to outliers and reduces the model's sensitivity to small differences between the predicted and ground-truth bounding boxes.

A. FEATURE FUSION BLOCK
The feature fusion block enhances the semantic knowledge of bottom-layer feature vectors [37]. Hence, it improves the results of the localization of tiny tumors. Moreover, the ReLU activation function works for non-linear functions among various layers. Then, batch normalization (BN) is employed to avoid a gradient vanishing problem. Additionally, it improves performance. As in the RetinaNet detector, the implication of ResNet-101 as a base network further helps the model extract the most powerful features [38]. In Figure 2, layer modules L1, L2, L3, L4, and L5 represent numerous backbone network scales. FPN layers ranged from L3 to L5; L3 refers to the shallowest layer. The shallow layer plays a vital role for small lesions detection. Nonetheless, it cannot extract semantic features presented at the deepest layer, L5. Following Feature Fusion Single Shot Multibox Detector (FSSD) based aggregation block, a conv. layer of 1 × 1 is utilized on each layer from L3 to L5 to generate L3_minimized, L4_minimized, and L5_minimized. The aim of 1 × 1 conv. layer is to minimize the dimensions of these layers. Then, minimized layers L4_minimized and L5_minimized are engorged through a bilinear up-sampling to convert them to the same dimension as the L3_minimized layer. Then, these minimized layers are combined at once, employing a concatenation method [39]. The concatenated layer is transformed using a dilated context block that is enlightened later. In the end, the context-concatenated layer is employed to BN and ReLU function to form a p3 bottom estimation layer.

B. DILATED CONTEXT BLOCK
In this section, dilated blocks, developed by DetNet, and an aggregation CN [40] are employed to improve the information of surrounding tissues necessary for tiny lesion pixels. A block is proposed to extract more features, enrich more data from surrounding pixels, and improve detection, as shown in Figure 3. The block is fixed before lateral connections. It comprised a 1 × 1 convolutional layer for dimension minimization of a 3 × 3 convolutional layer to simultaneously have a dilation rate of 2. These two branches are combined using a feature's map concatenation operator [41]. Then 3 × 3 convolutional layer is employed with the concatenated feature map.

C. LATERAL CONNECTION
These connections ensure improvements of features and compensate the loss of information due to down-sampling. Additionally, the dense architecture enhances stability during network training. Furthermore, it also improves the lesion detection precision. Figure 3 shows these connections form 2 estimation layers, p4 and p5. Decreased feature map L4 is applied to dilated context block. Then, it was connected laterally with the down samples conc. layer through a fusion technique to show estimation layer p4. A similar process is employed on reduced layer L5 that generates estimation layer p5. Thus, estimation layers p6 and p7 are kept without any modification, and their estimations are utilized to improve large lesions detection. Each estimation layer in the contextual lower to upper lateral connections fusion block has 256 channels of features.

D. ANCHORS AND SUB-NETWORKS
In RetinaNet, anchors are bounding boxes of different scales and aspect ratios that are placed at various positions on the image. These pre-defined anchors serve as reference points during training to predict the sizes and locations of objects in the image. The use of anchors allows the network to handle objects of different sizes, shapes, and partial occlusions.
In Lung-RetinaNet, anchors are employed without any changes. Numerous anchors may have different sizes, such as 32 × 32 to 512 × 512, which are fed to multi-scale estimation layers from p3 to p7, respectively. In total, nine anchors are included in each layer, with box ratios: 1:1, 1:2, and 2:1, and varying dimensions for every box include 2 1/3 , 2 2/3 , and 2 0 . The anchors with a Jaccard overlap number < 0.5 are not considered as tumors; they are considered as normal lung tissue. RetinaNet has shown that increasing the number of anchors above 9 does not improve the performance. Therefore, we utilized 6-9 k-clusters in the K-means clustering technique for anchors generation on the Lung tumor dataset.
Lung-RetinaNet is based on an improved RetinaNet's structure that is an object-detecting technique using context aggregation module, Huber loss and focal loss for data training. Two sub-networks exist along with a backbone for networks help the model in feature extraction. The sub-networks in RetinaNet are responsible for extracting features from different levels of the feature pyramid, which is a hierarchical representation of the image capturing features at various scales.
One of them is the classification sub-network that recognizes the class of image, whereas the other sub-network is known as regression which generates the bounding box (bbox). These sub-networks are comprised of four 3 × 3 convolutional layers having 256 channels at each layer. Then, the RELU activation function is employed to activate each layer's output. Further, classification is performed using an additional 3 × 3 conv. layer, which is activated using the sigmoid activation function for categorizing various objects k having specific anchors A per spatial locality. In the end, regression is performed by a conv. layer of 3 × 3 having four elements exhibiting offset among an estimated bbox and target bbox per spatial locality. We performed various experiments using different neural networks models with RetinaNet, such as VGG16, DenseNet201, EfficientNet82, and MobileNetv2. The results are reported in experimental section.

1) FOCAL LOSS
For the purpose of classification, RetinaNet utilized focal loss and computed, as shown in Equation 1. The focal loss is a cross-entropy having weights to overcome the issue of class imbalance. It drops out easy training examples during the training phase and considers the difficult ones.
Here, CE refers to the cross entropy, and jϵ±1 represents the target, whereas iϵ[0, 1] exhibits the likelihood of the estimated class having label j = 1.
where, i T is the predicted probability of the correct class label that is equal to i for j = 1 and for j = −1 equal to 1i.−e (1 − i T ) Y is the modulation element to the CE having jϵ[0, 5], which can be adjusted to minimize the CE, and e is a hyper-parameter. e and γ represent 0.25 and 2, correspondingly for the significant performance [42].

2) HUBER LOSS
RetinaNet utilized the Huber loss function, computed as shown in Equation 3. Where e is a hyper-parameter that is adjustable and set as 1. d presents the distance between two vectors.

IV. EXPERIMENTAL EVALUATION
In this section, the suggested method is evaluated through various experiments, with a focus on metrics and environmental setup for performance evaluation.

A. TRAINING PARAMETERS
For training our proposed Lung-RetinaNet, we used both RetinaNet and Fused models, which employ similar parameters. The training parameters are almost same for original RetinaNet and Lung-RetinaNet as 41M parameters for ResNet50 and 64M parameters for ResNet101. The comparative graph for both models is shown in Figure 4.  Table 1. The model achieves 98.3% mAP on the dataset.

B. DATASET
In the proposed study, we used two datasets: Lung Image Database Consortium (LIDC-IDRI) [43] and 50 recorded CT scans of lungs from the Simba lung database [44]. We trained and tested our model using the LIDC-IDRI dataset. For crossvalidation, we employed CT scan samples from the Simba database.  In the LIDC-IDRI challenge, all CT scans were attained from the archive of The University of Chicago. All the images were from unique patients and taken from the Philips Brilliance Scanners having 1.15-mm thickness of the slice and ''D'' extra enhancing conv. kernel. The dataset included 1024 patients' CT images of lung nodules along with an XML file presenting results. The nodules identified by the radiologists were categorized as benign and malignant in the dataset. The total scans were 1018, among from we used 750 samples to train our proposed lung cancer detection system. The 450 samples belonged to the malignant class having various sizes of tumors, and the remaining 300 belonged to the benign class.
Second lung database comprises 50 CT images for the recognition of lung tumor. The CT scans were taken with a thickness of 1.25mm slice during a single breadth. The radiologist gave the location of nodules in the dataset. Various test sample images are shown in Figure 5.

C. METRICS
For assessing the suggested system, we employed various parameters such as Precision, Accuracy, F1 Score, Recall, and Area under the curve (Auc). The equations are presented below.
The area under the curve (Auc) is computed as below: Table 2 displays the details of the system used in conducting the experiments, which involved a geforce GTX GPU card with 4GB memory.

E. LOCALIZATION OF LUNG TUMOR
In this unit, the performance of our fusion-based RetinaNet is assessed utilizing four metrics: DOI, TC, no. of pixels, and area. Two metrics frequently utilized in image segmentation are Texture Complexity (TC) and Degree of Interest (DOI). Texture Complexity measures the variety and intricacy of patterns within an image, while Degree of Interest evaluates the significance of different areas in the image with regards to the intended purpose. The equations of the used metrics are presented below: Our lung cancer detector's performance was evaluated by comparing the results of 50 images from each dataset, with the TC varying between 0 and 1. The metrics were computed using the respective ground truth images, which were assessed by an expert pathologist for the LIDC-IDRI dataset. Table 3 reports the results of 10 images from the LIDC-IDRI dataset, showing that our proposed model achieved excellent results in terms of tumor localization. We have also performed a second experiment to assess the performance of our proposed Lung-RetinaNet using the Simba dataset. We compared our proposed model with various convolutional neural networks as the backbone network of RetinaNet using DOI and TC. We used DenseNet201, VGG16, EfficientNet82, ResNet101, and MobileNetV3. The results are shown in Table 4. We noticed that Efficient-Net82 and MobileNetv2 did not perform significantly. The lousy performance could be due to the varying feature maps among blocks. The VGG16 and DenseNet201 performed better than the MobileNetv2 and EfficientNet82. Although VGG16 performed better, the inference time was relatively high. The DenseNet201 performed better due to the direct connections among layers for feature extraction. Moreover, the best results were attained from our fused model, and the second highest results were achieved employing ResNet101. The reason could be that ResNet101 utilized residual blocks and feature maps transferred among layers, solving the vanishing gradient issue. Therefore, we have selected our feature fusion-based RetinaNet for lung tumor detection.

F. DETECTION AND CLASSIFICATION OF LUNG TUMOR
This section presents the results of our proposed model's classification performance using two datasets: LIDC-IDRI and Simba. The proposed model was trained on a total of 750 images and tested on 300 samples, with 150 images from each class: malignant and benign. Table 5 shows that the proposed model achieved significant results in classification. Additionally, we trained three object detectors: Faster RCNN, Mask RCNN, and Lung-RetinaNet, with the latter achieving the best results, including 99.8% accuracy, 99.3% recall, 99.4% precision, 99.5% F1 score, and 0.989 AUC. The next best classification results were obtained using Faster RCNN with 98.3% accuracy, followed by Mask RCNN with 97.3% accuracy. Moreover, we also compared the inference time for the techniques mentioned above, and it is clear from the outcomes that our suggested system is faster than others. The reason could be its one-stage detection mechanism and simple architecture of layers. Therefore, we believe that our proposed Lung-RetinaNet is an efficient tumor detector. For the second experiment, we used the Simba dataset to cross-validate our proposed model. We used 50 images from the malignant class and achieved the best results with our proposed Lung-RetinaNet, including 99.3% accuracy, 99.1% recall, 99.2% precision, 99.3% F1 score, and 0.977 AUC. The next best classification results were obtained using Mask-RCNN with 99.0% accuracy. The lowest accuracy for cross-validation was achieved using the Faster RCNN classifier, with 98.1%. Table 6 provides more details on the crossvalidation results. The better performance of Mask RCNN over Faster RCNN could be because it is easy to generalize and train. Moreover, it only adds up a little overhead to Faster RCNN, making it more efficient. Furthermore, our proposed Lung-RetinaNet is way better than Mask RCNN and Faster RCNN in efficiency as it takes minimum time for inference for both datasets.

G. ABLATION EXPERIMENTS
In this section, various methods for detecting small tumors in lung images will be presented. The Simba dataset was used for ablation study experiments. Initially, the effects of incorporating a feature fusion module, dilated context module, and lateral connections are examined. Afterwards, the impact of adjusting hyper-parameters of the focal loss on accuracy is discussed.
In the first ablation study, we performed experiment to analyze the performance due to fusion module. In original RetinaNet, fusion module and dilated context module is not present. We added the lateral connections and performed the Focal loss adjustments similar in RetinaNet, however, the mAP attained as 85.23% as shown In Table 7. It is clearly visible that when fusion module is used alone without dilated context module and lateral connections, the detection performance degrades due to information loss and down-sampling and model is unable to identify tiny tumors accurately. However, when lateral connections are added along with fusion module, the detection performance improves. Moreover, for our proposed Lung-RetinaNet, when we added the lateral connections, fusion module, dilated context block, and focal loss adjustments, we achieved remarkable results for tiny tumor detection. The RetinaNet detector utilizes focal loss as its primary approach to address the class imbalance issue, with the hyper-parameters α and γ weighting factors set to 0.25 and 2, respectively, resulting in optimal RetinaNet performance. However, for Lung-RetinaNet, this setting is not the most effective, and modifying the focal loss hyper-parameters could enhance Lung-RetinaNet accuracy. Table 9 demonstrates that adjusting α and γ to 0.25 and 2.5, respectively, yields the highest Lung-RetinaNet performance, resulting in a 1.5 point increase in accuracy.

H. COMPARISON WITH EXISTING TECHNIQUES
We compare our proposed detector for lung tumors with existing techniques. The outcomes are reported in Table 10. It is clearly seen that our suggested detector achieved 99.8% detection accuracy, and the results have been validated by two radiologists. Xie et al. [45] utilized 888 samples from the LUNA16 dataset and developed a system based on two modules. Firstly, they detected the locations in images using an improved Faster RCNN. Secondly, they trained three models for false positive reduction. The system became somehow complex and achieved 88.17% detection accuracy, which is lower than others. Chao et al. [46] developed a CNN-based technique for detecting lung tumor nodules after pre-processing images from LUNA16 and Kaggle datasets, attaining 92% accuracy. Moreover, Eali et al. [47] provided a multi-view network approach along with weighted gradient activation for the binary classification of lung tumors. The proposed model was lightweight; however, a tremendous amount of information was lost due to a class invariant problem in the max-pooling layers of the proposed CNN. Nevertheless, they attained a considerable detection accuracy of 97.17%. Comparatively, our proposed Lung-RetinaNet attained 99.8% detection accuracy while retaining the key information of lung CT images. Moreover, our proposed detector is based on one stage that is easy to use and modify. Therefore, our proposed Lung tumors detector and classifier achieved better results than existing techniques in terms of complexity and accuracy. We selected RetinaNet for the lung cancer detection as it offers several benefits over other models. Firstly, it achieves superior accuracy on several object detection benchmarks, even when dealing with severe class imbalance. This is due to its design, which enables it to identify objects with high accuracy. Secondly, RetinaNet boasts faster inference speeds compared to traditional two-stage detection models. It uses a single-stage detection process, which means that it requires only one forward pass through the network to produce object detections. Thirdly, RetinaNet is easy to train due to its simple architecture and the use of focal loss, which simplifies the learning process by focusing on hard examples. Fourthly, RetinaNet is highly effective at detecting small objects due to the use of sub networks and usage of the context block that provides multi-scale representations of the input image.
Lastly, RetinaNet has a high recall rate, meaning that it can identify most of the objects in the image, even when they are small or partially occluded. Overall, RetinaNet is a powerful model for object detection that provides high accuracy, speed, and recall, making it a popular choice for a wide range of applications.

I. COMPUTATIONAL COST
In the lung tumor segmentation task, our one-stage Reti-naNet detector has proven to outperform previous methods, which included one-and two-stage detectors such as faster R-CNN, R-CNN, and SSD321. Faster R-CNN had an average precision of 36.4% at the first five scale (400-800) pixels, with an inference time of 192ms, while R-CNN and SSD321 had an average precision of 35% and 39%, respectively, with inference times of 95ms and 82ms. By comparison, our proposed methods, which included RetinaNet with ResNet-50 and ResNet-101, achieved average precisions of 25% and 31%, respectively, at the first five scale (400-800) pixels. Additionally, we achieved an inference time of 57ms and 51ms, respectively, as illustrated in Figure 6.

V. CONCLUSION
This work suggests a novel and robust lung tumor detection method based on fused RetinaNet. Our proposed model utilizes CT scans for the training and testing of the model. We have introduced a context aggregation module instead of FPN in the original RetinaNet. Our proposed model comprises a ResNet as a backbone network for extracting features from input along with dilated contextual block-based fusion, categorization and regression sub-networks. Due to an improved structure of RetinaNet, it precisely detects tiny lung tumors. More specifically, a fusion block is down-sampled and combined with a context-dilated module at all networks level. This connection enhances the capability of the proposed network to extract valuable features. Moreover, it also improves the localization performance of the proposed network. Our proposed methodology achieved 99.8% accuracy, 99.3% recall, 99.4% precision, 99.5% F1 score, and 0.989 AUC. We evaluated our proposed method and compared its results with state-of-the-art DL-based methods, which revealed that our technique outperforms existing systems.Thus, we believe that our proposed system can be utilized by medical experts to identify the tumor at early stages.
Our proposed method performed significantly for lung tumor detection, however, it faced some challenges which are required to be addressed in the future. First, the ability to identify tiny tumors is reduced due to the limited resolution of the input images. Second, if the images had a lot of clutter or noise, the system was not able to differentiate the background tissues from tiny tumors.
In the future, our objective is to employ our suggested method for the multi-classification of various cancers, such as skin, bone, etc. Moreover, the training time of the proposed model was considerably high; however, it could be improved using a multi-GPU-based training system. This will also improve the inference time of the proposed system to deploy it clinically. Furthermore, we will add multi-modal information as input to our system such as combining imaging data with other forms of clinical data, (genomic or proteomic data) to improve the accuracy of tumor detection.

DATA AVAILABILITY
The data used for this research is available publically.