Automated PD-L1 status prediction in lung cancer with multi-modal PET/CT fusion

Programmed death-ligand 1 (PD-L1) expressions play a crucial role in guiding therapeutic interventions such as the use of tyrosine kinase inhibitors (TKIs) and immune checkpoint inhibitors (ICIs) in lung cancer. Conventional determination of PD-L1 status includes careful surgical or biopsied tumor specimens. These specimens are gathered through invasive procedures, representing a risk of difficulties and potential challenges in getting reliable and representative tissue samples. Using a single center cohort of 189 patients, our objective was to evaluate various fusion methods that used non-invasive computed tomography (CT) and \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$^{18}$$\end{document}18F-FDG positron emission tomography (PET) images as inputs to various deep learning models to automatically predict PD-L1 in non-small cell lung cancer (NSCLC). We compared three different architectures (ResNet, DenseNet, and EfficientNet) and considered different input data (CT only, PET only, PET/CT early fusion, PET/CT late fusion without as well as with partially and fully shared weights to determine the best model performance. Models were assessed utilizing areas under the receiver operating characteristic curves (AUCs) considering their 95% confidence intervals (CI). The fusion of PET and CT images as input yielded better performance for PD-L1 classification. The different data fusion schemes systematically outperformed their individual counterparts when used as input of the various deep models. Furthermore, early fusion consistently outperformed late fusion, probably as a result of its capacity to capture more complicated patterns by merging PET and CT derived content at a lower level. When we looked more closely at the effects of weight sharing in late fusion architectures, we discovered that while it might boost model stability, it did not always result in better results. This suggests that although weight sharing could be beneficial when modality parameters are similar, the anatomical and metabolic information provided by CT and PET scans are too dissimilar to consistently lead to improved PD-L1 status predictions.


Patient cohorts and data collection
The dataset used in this study consists of 189 NSCLC patients having undergone an initial staging PET/CT scan.All images were acquired on a Biograph mCT40 (SIEMENS, Erlangen) using a standard half-body PET/CT imaging protocol based on the EANM guidelines 42 after intra-venous administration of 2.5-3 MBq/kg of 18 F-FDG.A low-dose CT was acquired for attenuation correction and anatomical correlation of PET abnormalities (tube current 120 kV, CARE Dose 4D current modulation system, reconstruction using 5 mm slice thickness).PET acquisitions were performed using 3.5 minutes per bed position, and PET images were reconstructed using the OSEM-TrueX-TOF algorithm (3 iterations, 21 subsets, voxel size of 4.073×4.073×2.072mm 3 ).In addition to images, the dataset includes patient clinical attributes including factors like sex, stage, age, and PD-L1 expression levels.Out of the total 189 cases, 105 cases (56%) were assigned as PD-L1 positive, while the remaining 84 cases (44%) were categorized as PD-L1 negative.

Data pre-processing and model training
To guarantee consistency among PET and CT volumes, PET images were resized to match the CT scans for every patient.This resizing procedure includes applying bilinear interpolation methods, ensuring the preservation of the original content of the image while accomplishing dimensional alignment.We preferred to avoid downsampling of the CT images to the PET resolution based on the EANM/SNMMI recommendations 43 where it is clearly mentioned that downsampling may lead to aliasing artifacts and may require application of a lowpass filter prior to resampling.At the same time they have noted that there are no clear indications whether upsampling or downsampling schemes are preferable.The alternative would have been to use both upsampling and downsampling of the PET and CT images to a common isotropic voxel size of, for example, 2 × 2 × 2 mm 3 but this configuration was not tested in this work.In order to facilitate targeted whole lung volume analysis, we developed a whole lung segmentation model.To accomplish this, we used the U-Net model 44 trained by utilizing the publicly accessible LIDC-IDRI dataset which includes 1018 cases 45 .Each case of this dataset includes clinical thoracic CT images and an XML file containing the outcomes of a two-phase image annotation process conducted by four experienced thoracic radiologists.The trained U-Net model was subsequently used to segment the whole lung fields in the CT images of our patient datasets.
The segmented lung fields were used to define different image configurations as input to the network model architectures and fusion schemes of the PET and CT images (Fig. 1).The initial setup, denoted as Config1, considers a bounding box of 10 pixels in all three dimensions around the segmented whole lung field.This configuration was considered to account and evaluate the potential impact of any expected errors in segmentation while guaranteeing total coverage of the lung region.The second configuration, denoted as Config2, utilizes a larger, broadly extended bounding box with an extra margin of 100 pixels, which incorporates lung regions as well as surrounding tissues.This configuration allows assessing the potential impact of avoiding the whole lung segmentation step using instead as input to the network most of the original acquired images.The last configuration denoted as Config3, concerns the largest lung lesion semi-automatically segmented using the Fuzzy Locally Adaptive Bayesian (FLAB) algorithm 46,47 .The obtained volume of interest was adjusted manually by an experienced clinician.The performance of this algorithm has been extensively evaluated for functional tumor volume segmentation, demonstrating high reproducibility and robustness for different cancer models [35][36][37]47 . Tumr segmentation on CT images was performed by applying the PET tumor volume of interest on CT images using 3D Slicer 48 .These configurations allow us to comprehensively assess the impact of the tissue/organ types, including tumor only, used as an input to the models for the PD-L1 prediction.
We executed our models utilizing PyTorch and MONAI (https:// monai.io/).Various models with relating configurations were prepared using Nvidia A6000, Titan RTX and 2080 GPUs.For our models, we utilized a standardized image size of 90×90× 90 on each of the six data pipelines and performed noise reduction for each modality, as displayed in Table 1.To reduce overfitting, data augmentations such as horizontal flip, random resizing cropping, random rotation, and random color jittering were used throughout the training phase.The model was hence trained and evaluated five times, each time involving a different fold as the validation set.This approach allows us to assess the performance and over-simplification of the model on various data subsets exhaustively.
All models were trained for a sum of 200 epochs, with a batch size of 1. Different optimizers and learning rates were chosen relying upon the model architecture: the DenseNet-264 and EfficientNetBN1 backbones  To additionally work on the model's convergence, we utilize a step decay methodology in changing the learning rate.We applied step decay with a component of 20 for the ResNet model, while an element of 10 was used for DenseNet and EfficientNet.This approach gradually reduces the learning rate during training, working with critical updates during the initial phases of training and better.The training target used was a cross-entropy loss function while utilizing a few on-the-fly data augmentation methods including rotation, flipping, zooming, and random shift intensity.

Network architectures
In this work, we investigate the use of diverse deep learning models and multi-modal fusion techniques for automated prediction of PD-L1 status in lung cancer utilizing both PET and CT imaging modalities.To achieve this goal, we consider three broadly used CNN backbones: ResNet101, DenseNet264 and EfficientNetB1.These models were chosen because they are well-known and widely used in the field of deep learning for various computer vision tasks, including image classification, object detection, and segmentation.They have demonstrated state-of-the-art performance on benchmark datasets and are known for their effectiveness in learning complex patterns and features from images.ResNet101 and DenseNet264 are deep convolutional neural network architectures with a large number of layers.They are capable of capturing complex features and representations from images, which may be beneficial for tasks requiring detailed analysis of medical images such as PET and CT scans.EfficientNetB1, on the other hand, is known for its efficiency in terms of computational resources while maintaining competitive performance.
ResNet101: Residual Network (ResNet) forms a CNN architecture built on the concept of residual learning 49 .This strategy encourages the training of deep neural networks by presenting short-cut connections that skip one or more layers.
DenseNet264: Densely connected convolutional network (DenseNet) is another CNN model known for its densely connected layers 50 .Each layer is firmly connected to all others in a feed-forward way, ensuring a solid level of feature reuse and a consistent stream of information across the network.
EfficientNetB1: A more recent expansion to the CNN architectures, EfficientNet is outlined for principled scaling, including not only the expansion of layer depth and width but also the determination of the input 24 .
In spite their distinct characteristics, these models take a comparative operating system.In each model, information passes through a sequence of convolutional layers, where each layer incrementally extracts higher-level feature representations.After passing through convolutional layers, a subsequent global average pooling layer is presented.This pooling layer decreases spatial dimensionality while preserving basic information.Subsequently, a fully connected layer maps the high-dimensional information into likelihood scores for each class.The sigmoid function is then applied to these scores to yield a final classification.This sequential stream of information from the initial input to the final classification, is a shared characteristic common to all three models, empowering consistent interchangeability while keeping the overall architecture.Leveraging these models, we design six distinctive data architectures that consider both single and multi-modal fusion procedures.
We characterize the performance of different models and associated fusion protocols using PET and CT modalities.Here, we denote I I I CT ∈ R H×W×D and I I I PET ∈ R H×W×D the PET and CT images respectively.Our principal objective is to develop a deep learning model predicting PD-L1 classification, denoted as y ∈ {0, 1} , in view of fused PET/CT images.For this purpose, we employ CNN-based models, which process the modalities either independently or jointly, depending upon the chosen fusion scheme.In both scenarios, we utilize models with associated weights , to extract features from the input images, resulting in a high-level feature representation h h h i = f (I I I i ; �) with i ∈ {CT, PET}.
In cases adopting a late fusion scheme, we derive two feature vectors, h h h CT and h h h PET , obtained by processing I I I CT and I I I PET independently through their respective model networks.These feature vectors are then linked and concatenated through a fully associated connected (FC) to deliver the last prediction ŷ = g(h h h CT , h h h PET ) , where g refers to the FC layer, outfitted with a different arrangement of weights.
We aim to determine the set of optimal weights and the weights of the FC layer which minimize the crossentropy loss between the predicted class ŷ and the ground truth class variant y on an informational index of size N.This is mathematically formalized as the minimization of the following loss function: where L (y, ŷ) = −y log(ŷ) − (1 − y) log(1 − ŷ) is the binary cross-entropy loss.This problem formulation outlines the way we handle the multi-modal data, highlighting the fusion of PET and CT image modalities for improved PD-L1 status classification.
Architecture A This architecture exclusively depends on CT scans as input information (Fig. 2).This backbone comprises of few layers that steadily abstract the input information into high-level features which follows the data pre-processing and model training section.A global average pooling layer follows to decrease the spatial dimensions while keeping the most important information.These high-level features are then forwarded to completely connected layers that execute the mapping into scores for each class.A final classification is created by applying these scores to the softmax function.This cycle addresses the complete stream of information within this architecture, from the input CT volume to the final PD-L1 status forecast.

Architecture B
Comparable to the previous architecture, Architecture B serves as a baseline model focusing only on the PET (Fig. 3).This architecture is trained exclusively on PET images, utilizing an isolated assessment of the individual contributions of PET volumes to PD-L1 status prediction.The designated CNN-based models that are similar to Architecture A are fed into the data following the same data preprocessing.By directing a comparative investigation of the performances of Architecture A and B, we can effectively evaluate and separate the predictive capacities of PET and CT modalities.This comparative evaluation provides insights into the specific and individual contributions of each modality to the task of PD-L1 status prediction.After performing the same information pre-processing, it is fed into the assigned CNN-based models comparative to Architecture A. By conducting a comparative investigation of the exhibitions of Architecture A and B, we will viably evaluate and separate the prescient capabilities of PET and CT modalities.

Architecture C
The C architecture considers an early fusion approach, combining the PET and CT modalities right at the beginning of the processing pipeline (Fig. 4).The fundamental stage incorporates partitioned data pre-processing steps for PET and CT information.These preprocessed PET and CT volumes are at that linked together, making a consolidated input to a single CNN-model.The C architecture is outlined to investigate the advantages of early fusion, aiming to choose if a jointly learned representation outperforms models that depend on reasonable a single modality.By conducting a comparative examination between architecture C and its counterparts, A and B, assessing architecture C intends to explain the impact of early fusion on the predictive capabilities of the classification pipeline.

Architecture D
Architecture D proposes a late fusion technique, as shown in Fig. 5.After performing data pre-processing for PET and CT data, the volumetric information from the two modalities are coordinated independently to their partitioned CNN models.These models are trained to extract and capture pertinent data from the input information, eventually creating high-level features.In this pipeline, the models learn modality-specific representations, without inter-modality conditions.The fusion of modality-specific deep feature vectors from both PET and CT modalities occurs prior to their presentation to fully connected layers.This late fusion framework allows the network to explore the specific information from each modality independently.The final step of this architecture is to apply a sigmoid to the likelihood scores created from the connected layers to provide the PD-L1 status.
Architecture E Architecture E closely resembles Architecture D, with the principal distinction being that specific layers in the configuration share similar weights, as displayed in Fig. 6.This strategic approach leverages the existence of shared low-level features that are present across various modalities.The shared component incorporates the initial segment of the network, responsible for processing features such as edges and textures.As the network goes further into the last layers, we will keep up with their separate weight sets, permitting them to completely  learn and separate higher-level, task-specific features.These high-level features frequently incorporate more complex and unique information perspectives that are straightforwardly connected with the particular tasks at hand.When the feature extraction stage is finished, fully connected layers are applied, followed by a sigmoid function that characterizes the PD-L1 status.This shared and separate weighting framework can be applied to various model structures, with varieties in the particular common classes implement relying upon the design and requirements of individual models.

Architecture F
The architecture F, outlined in Fig. 7, addresses the last part of weight sharing among similar networks.Compared to the E architecture, which utilizes constrained weight sharing, this approach receives a strongly all-weightsharing procedure.The outcome of this architecture could be a synchronized feature extraction process for both modalities.In this course of action, the PET and CT volumes experience isolated data pre-processing steps before entering the models, which is completely shared-weight.These models are responsible for feature extraction, pooling, and following fully connected layers.Finally, the sigmoid function prompts the classification of PD-L1 status.Architecture F aims to make a universal and versatile representation able of capture shared traits over the PET and CT modalities.By differentiating this architecture to the already depicted models in our study, we aim to understand the potential performance benefits displayed by complete weight sharing to diverse structures.

Performance evaluation
During the training and validation stage, we utilized a random choice of samples to ensure the diversity in characteristics of the dataset.In each fold, we utilize 132 samples for training, and 57 for testing, yielding a split of 70% and 30%, separately.The uneven partitioning of samples between the training and testing sets is reused across all folds, and the final model performance estimate was determined by averaging the results obtained from each fold.
The Area Under the Curve (AUC ) metric and Receiver Operating Characteristic (ROC) curve analysis, sensitivity and specificity were used to assess the performance of the classifiers.The DeLong strategy was utilized to determine the 95% confidence intervals (CI).The median and interquartile range (IQR) with 95% CI were utilized to address continuous variables.All statistical tests were two-tailed, with statistical significance set at p < 0.05.

Ethical standards and informed consent
The study was approved by the local ethics committee of the University Hospital Poitiers, France.Informed consents were obtained for all patients.All procedures were in accordance with the ethical standards of the institutional research committee and with the 1964 Helsinki declaration and its later amendments.

Results
The mean AUC, specificity, and sensitivity of various models across various architectures, as well as their 95% confidence intervals in testing datasets, are the primary focus of this section, which provides a comprehensive overview of their performance.
In Config 1 as displayed in Table 2, for ResNet, the architecture C showed a better performance with an AUC of 0.79 with a confidence interval of (0.77-0.81), which is generally narrow, showing a high level confidence in this result.Architecture B of ResNet additionally shows better performance with an AUC of 0.75 (CI: 0.72-0.77),comparable to architecture E. These architectures are reliably associated with high AUC values.In   3 presents the assessment of the various models using the extended whole lung field image configuration (Config 2).Architecture C provides the better performance with an AUC of 0.80 (95% CI: 0.78-0.83).It provides consistently good results across AUC, specificity, and sensitivity metrics.ResNet architecture D and E display mixed performance, with AUC values of 0.76 and 0.73, respectively.Regardless of contrasts in AUC, the models perform similarly in terms of specificity and sensitivity, with values around 0.72.The DenseNet C architecture demonstrates robust performance with an AUC of 0.81 (95% CI: 0.80-0.85),similarly to the performance of the ResNet C architecture.These results show that the C architecture consistently performs well when it comes to binary classification.DenseNet's architecture D also performs well with an AUC of 0.79 (95% CI: 0.78-0.83).Like ResNet, DenseNet keeps a harmony among explicitness and responsiveness.DenseNet structures E and F yield serious AUC upsides of 0.76 and 0.74, separately.Even though these values are slightly lower than those of the best architectures, they still provide balanced performance for class prediction.EfficientNet's architecture D performs better with an AUC of 0.79 (95% CI: 0.78-0.82).
Table 4 presents the assessment of the different models while utilizing the tumor segmented volumes as only input.Architecture C yields the better performance with an AUC of 0.83 (95% CI: 0.81-0.86).The C architecture provides reliably good results across AUC, specificity, and sensitivity metrics.ResNet architecture D and E show mixed performance, with AUC values of 0.77 and 0.74, respectively.Regardless of contrasts in AUC, the models perform similarly in terms of specificity and sensitivity, with values around 0.73.The DenseNet C architecture illustrates robust performance with a best mean AUC of 0.84 (95% CI: 0.82-0.88),similarly to the performance of the ResNet C architecture.These results highlight the consistently good performance of the C architecture for binary classification.Architecture D of DenseNet also performs well with an AUC of 0.80 (95% CI: 0.77-0.81).Comparative to ResNet, DenseNet maintains a balance between specificity and sensitivity.DenseNet architectures E and F have AUC values of 0.80 and 0.76, respectively, that are competitive.The EfficientNet's architectures D and E architectures perform better with AUCs of 0.83 (95% CI: 0.81-0.83)and 0.77 (95% CI: 0.74-0.80),respectively, despite being slightly lower than the architectures that perform the best in class prediction.
Among all the models assessed, Architecture C outperforms the rest of the configurations, highlighting the significance of the architecture in improving the overall performance of the model.This finding highlights the significance of the Architecture C in improving the overall performance of the model.In terms of AUC, ResNet and DenseNet typically perform better than EfficientNet, suggesting that they might be better suited for this particular classification task.For both Architecture F of ResNet and Architecture C of DenseNet, narrow CIs indicate results that are more reliable and consistent.In looking over the performance of the ResNet, DenseNet and EfficientNet models over different architectures, we observe that early fusion reliably yields better outcomes relative to a late fusion as appeared by the mean AUC results, illustrating better model performance across diverse configurations.Interestingly, among the late fusion variations (D, E, F), it is the architecture Table 3. Results for different involved architectures and data pipelines using the testing datasets in Config2.Significant values are in bold underline.www.nature.com/scientificreports/D, where loads are not shared, that performs better than the models E and F that utilize loads' sharing.This observation suggests that not sharing weights in the late fusion cycle may be an ideal setup, recommending a degree of freedom within the basic features captured by the PET and CT images All three configurations in terms of the image data volumes used as an input (see Fig. 1), illustrate comparable performance across the ResNet, DenseNet, and EfficientNet models, with certain architectures consistently outperforming others.Config 3 shows better performance compared to Configs 1 and 2, this is likely due to that fact that this configuration is focused on the specific lesion segmentation.Besides, the early fusion architecture demonstrates a more modest standard deviation than those seen in the late fusion architectures.This recommends a more consistent model execution, with less variability between different runs, further supporting the quality and robustness of the early fusion architecture.Taken together, these results highlight the predominance of early fusion over late fusion, the adequacy of not sharing weights during late fusion, and predictable and dependable execution unwavering performance given by the early fusion technique.The classification activation map (CAM) visualization of the predictive model identified the center of the tumor region as a critical region for PD-L1 status classification (Fig. 8).

Discussion
The principal objective of this study concerns the classification of the PD-L1 status in NSCLC patients utilizing deep learning in combination with multi-modality PET/CT images.A specific secondary objective concerns the evaluation of different configurations spanning from the DL model choice to the input images used including late or early fusion of information provided from functional and anatomical images.This approach varies from the investigations in the current literature, in which deep learning models and input conditions are regularly chosen first and subsequently enhanced with clinical or radiomic data.Imaging as a noninvasive method is highly effective in predicting gene expression and molecular levels in many types of tumors 51,52 and may serve as an important surrogate marker in treatment decision-making.The current state of the art in lung cancer immunotherapy involves significant progress, with immunotherapy emerging as a crucial therapeutic approach 53,54 .In the case of lung cancer and to predict PD-L1 expression, to the best of our knowledge, there are only limited studies focused on predicting PD-L1 expression in their vast majority based on the use of PET/CT images 55,56 .Chest CT examination is the most routine detection method in the process of lung cancer diagnosis and treatment, as it is non-invasive, convenient, and easy to perform in daily clinical routine.Almost all NSCLC patients undergo multiple CT scans to track the progression of tumor lesions.On the other hand, in recent years, AI technology, especially deep learning, has been widely used for the interpretation of medical images.Deep learning technology has endless potential for detection, diagnosis, and treatment of lung cancer, from detecting lung nodules to identifying benign and malignant nodules in the lungs to further subtyping [57][58][59] .However, challenges persist, especially in assessing genetic tumor characterisation, including PD-L1 expression, through non-invasive methods [60][61][62][63] .
A non-invasive, deep learning-based PET/CT multi-modality imaging-based strategy is proposed in this study to address these difficulties.The examinations include a critical consideration for model execution, given a chosen architecture and comparing input information.More explicitly, a relative investigation of state-of-the-art networks such as ResNet, DenseNet, and EfficientNet in this setting addresses a novel contribution to the field.The integration of multi-modality imaging and the proposed fusion systems likewise showcase advancements beyond what has been investigated in past studies, underlining the novelty and significance of the ongoing work.
The ResNet architecture highlighted its suitability for PD-L1 expression prediction, as evidenced by its consistently superior performance.Besides the equivalent performance of the two fusion strategies in terms of AUC, the higher stability shown by architecture C across various folds demonstrates the superior generalisation capabilities of the early fusion approach.The preference for early fusion over late fusion indicates that integrating information from both PET and CT modalities at an early stage enhances model performance.The expected advantage of early fusion lies in its capacity to give a proper representation of PET and CT data at lower levels of the network, in this way permitting complex designs.The study also explored different fusion scenarios considering the temporal context of combining information recovered from functional and anatomical images, including weight sharing.Our outcomes also show that weight sharing may not be appropriate, considering that different type of information is given by the two imaging modalities considered (metabolism for 18 F-FDG images and anatomy for the CT images).This allows for a degree of freedom when it comes to capturing fundamental features from PET and CT images, while also highlighting the significance of not sharing weights during late fusion.Optimizing fusion strategies for better PD-L1 expression prediction can benefit from these findings.The relationships between these types of data are complementary, and as such the choice to share weights depends on the particular characteristics of the data.
Considering the results for models based on a single image input, the PET-only architecture shows better performance when using DenseNet, while ResNet is best with CT only images.In terms of absolute performance our results (mean AUC from 0.71 to 0.75 depending on the network used) are similar to those achieved by previous studies using CT images only for PD-L1 prediction.All of the single image configurations performed consistently worse than the networks using the multi-modality images, clearly demonstrating the interest of using both PET and CT images for this specific task.
The choice of the input image bounding box size, was also considered with respect to easiness of use and computational efficiency.Different input image configurations, such as whole lung segmented volumes, 2 times extended whole lung configurations, and tumor segmented volumes, were evaluated.The best results were consistently achieved with the use of the tumor only volumes, irrespective of the network, fusion or combination of image inputs considered.This clearly demonstrates that as expected there is no information concerning the PD-L1 status to be derived by using the whole lung volumes including healthy tissue, which generally led to worst performance.This result suggests that if one choses to use the whole images as an input to the network, there is a need to incorporate an automatic tumor segmentation from both the PET and CT images as a first phase of the network before proceeding to the extraction of deep features and their early fusion.
The essential limitation of our study is the lack of multi-center data but also limited dataset size which may influence the performance of the late fusion architecture.Future research directions will involve validation of this single center study results with external datasets within a multi-center study.While our findings offer valuable insights into the effectiveness of various architectures and fusion combinations in PD-L1 status classification, further investigation is warranted, particularly focusing on recent fusion methodologies.More specifically, recent studies [27][28][29] have shown that the combination of hand-crafted radiomics features within deep learning models can lead to enhanced results compared to image only models.Future investigations within the objective of further improving the obtained performance of deep learning models for PD-L1 prediction in NSCLC patients, will clearly need to include alternative fusion strategies, such as for example the fusion of deep learning features with hand-crafted features at an early or late stages.

Conclusion
While concatenating PET and the CT image vectors as proposed in this work is one way to deal with integrating the two modalities into the model, utilizing a double channel encoding procedure could offer benefits regarding effortlessness and interpretability.The two procedures have their benefits and might be more reasonable in various settings.The concatenation approach offers adaptability in how the model learns how to coordinate PET and CT data, considering more complicated cooperations between the modalities.However, it might also add complexity and additional parameters to the model architecture.The optimization of a deep learning model framework for the combination of PET and CT images in the context of PD-L1 expression prediction in NSCLC is what our research significant.By integrating information from these two imaging modalities, our study shows a significant enhancement in the accuracy of PD-L1 expression prediction relative to the use of a single modality only.Besides, the investigation of different fusion schemes for PET and CT image integration within diverse network architectures suggests that an early fusion leads to the best performance.In addition, the best results were obtained by focusing on the analysis of the primary lung lesion rather than the whole lung field.
Based on these findings our study adds significant knowledge to the understanding of optimal techniques for leveraging multimodal imaging information in predictive modeling of PD-L1 status in NSCLC.PD-L1 status is also important in other cancer types and as such the use of the same models may also play a role in predicting PD-L1 status in other cancers.

Figure 1 .
Figure 1.ROI: Lung segmentation using UNet Model trained on LIDC-IDRI to obtain 3 different PET/CT image configurations.

Figure 2 .
Figure 2. Architecture A: using only CT images as an input to the model.

Figure 3 .
Figure 3. Architecture B: using only PET images as an input to the model.

Figure 4 .
Figure 4. Architecture C: using both PET and CT images as an input to the model within an early fusion framework.

Figure 5 .
Figure 5. Architecture D: using both PET and CT images as an input to the model within a late fusion framework and without shared weights representation.

Figure 6 .
Figure 6.Architecture E: using both PET and CT images as an input to the model within a late fusion framework with partially shared weights representation.

Figure 7 .
Figure 7. Architecture F: using both PET and CT images as an input to the model within a late fusion framework with fully shared weights representation.

Table 1 .
Hyper-parameters used for the different pipelines.used the Stochastic Gradient Descent (SGD) with a learning rate (LR) of 0.001.The decision to use SGD for these designs originates from its adequacy in taking care of complex models with numerous parameters, like DenseNet and EfficientNet.A learning rate of 0.001 was considered satisfactory to guarantee smooth albeit with a slower convergence.For the ResNet-101 backbone, we used the Adam optimizer with a learning pace of 0.01.Somewhat higher learning rate contrasted with different backbones favors quicker convergence due to the generally less cumbersome ResNet model, in contrast to DenseNet or EfficientNet.

Table 2 .
Results for different involved architectures and data pipelines using the testing datasets in Config1.Significant values are in bold underline.

Table 4 .
Results for different involved architectures and data pipelines using the testing datasets in Config3.Significant values are in bold underline.