Improved transfer learning using textural features conflation and dynamically fine-tuned layers

Transfer learning involves using previously learnt knowledge of a model task in addressing another task. However, this process works well when the tasks are closely related. It is, therefore, important to select data points that are closely relevant to the previous task and fine-tune the suitable pre-trained model’s layers for effective transfer. This work utilises the least divergent textural features of the target datasets and pre-trained model’s layers, minimising the lost knowledge during the transfer learning process. This study extends previous works on selecting data points with good textural features and dynamically selected layers using divergence measures by combining them into one model pipeline. Five pre-trained models are used: ResNet50, DenseNet169, InceptionV3, VGG16 and MobileNetV2 on nine datasets: CIFAR-10, CIFAR-100, MNIST, Fashion-MNIST, Stanford Dogs, Caltech 256, ISIC 2016, ChestX-ray8 and MIT Indoor Scenes. Experimental results show that data points with lower textural feature divergence and layers with more positive weights give better accuracy than other data points and layers. The data points with lower divergence give an average improvement of 3.54% to 6.75%, while the layers improve by 2.42% to 13.04% for the CIFAR-100 dataset. Combining the two methods gives an extra accuracy improvement of 1.56%. This combined approach shows that data points with lower divergence from the source dataset samples can lead to a better adaptation for the target task. The results also demonstrate that selecting layers with more positive weights reduces instances of trial and error in selecting fine-tuning layers for pre-trained models.


INTRODUCTION
Transfer learning has made deep learning implementation easier and adaptable in many industries.The process involves reusing knowledge from a previous pre-trained model's task in another domain's task and dataset.Studies suggest this process works well in closely related tasks.For example, using a pre-trained model with learnt knowledge of car images to classify lorry images instead of using it to classify flowers.This example results in positive transfer learning due to sharable features from both domains: cars and lorries.measurements.The selected data points can be used in other model improvement tasks like augmentation.The data points selection builds better performance confidence in their application in machine learning algorithms.By selecting layers with higher positive weights, the pre-trained model can fine-tune much better, which improves its performance.These positive weights can give faster convergence, improving the adaptability of the target tasks.Pre-trained models can improve performance and user confidence in the target tasks during transfer learning processes by preparing the modelling pipeline using quality target dataset samples and adaptable-fine-tuned layers.This approach results in an effective and efficient pre-trained model selection procedure; from the previous trial and error selection of pre-trained models and fine-tuning layers.

Textural features of image data
In image processing, texture refers to an objective function representing the brightness and intensity variation of an image's pixels (Tuceryan & Jain, 1993).Texture explains images' smoothness, roughness, regularity and coarseness, as noted by Laleh & Shervan (2019).Texture gives the sequential illumination patterns of the pixels in an image and the image grey tones in the pixel's neighbourhood (Dixit & Hegde, 2013).Textural features in an image can be classified into three: low-level, mid-level and high-level.The classification is based on pixel levels, image descriptors and image data representation (Bolón-Canedo & Remeseiro, 2019).The features in an image are analysed to understand the spatial arrangement of the pixels' grey tones after their extraction.The extraction process can be categorised according to transformation, structure, model, graphical, statistical, entropy and learning views.The low-level image features are heavily used in image classification, utilising colour, texture and shape attributes.These attributes are passed through filters, quantified using statistical descriptors such as entropy and correlation, and ranked through relevance indices.
The two commonly used methods in textural analysis are the grey-level co-occurrence matrix (GLCM) and the local binary pattern (LBP) (Ershad, 2012).The GLCM was introduced by Haralick, Shanmugam & Dinstein (1973) to represent the pixels' brightness levels using a matrix that combines the grey levels intervals, direction and amplitude change.The GLCM descriptor has 14 features, with interval distance and orientation being the most important (Andrearczyk & Whelan, 2016).This study evaluates three features: correlation, homogeneity and energy.The LBP uses the local textural patterns in an image and compares the pixel's neighbouring grey levels.The comparison of the neighbours uses representative binary numbers described using histograms.The LBP is a robust textural descriptor used in edge detection and textural description (Zeebaree et al., 2020).

Textural features conflation of image data
Conflation refers to a merge of two or more probability distributions.The concept was introduced by Hill (2011).Given probability P 1 ,…, P n , the conflation(&) of 1 to n is expressed in Eq. (1) below, with Eqs. ( 2) and (3) giving the conflation for discrete distributions.Equation (4) shows the conflation of continuous distributions for textural features in the source and target domain data points.
where Q is the merged probability distribution.
where f 1 ; . . .; f n refers to probability density functions of the textural features.This equation can be rewritten as; where SP 1 ,…, SP n refers to the probability distributions of the source domain samples.The probability distributions of the target domain samples can be represented using Eq. ( 4) by substituting S with T.
The conflation of features removes redundant features resulting in a balanced probability distribution (Hill, 2011).

Model layer fine-tuning in transfer learning
Fine-tuning is one of the methods of transfer learning.The process involves selecting training layers and freezing the weights of the pre-trained model on a target task.In most cases, the first layers of a model are chosen due to their ability to extract features as opposed to the last layers, which are mainly used for classification purposes (Coskun et al., 2017).Fine-tuning has been a manual process involving selecting the first or the initial layers (in most cases, the last three layers of the network), as reported in the literature (Deniz et al., 2018;Fan, Lee & Lee, 2021).Vrbančič & Podgorelec (2020) noted that models have specific architectures that sometimes make their layer selection inefficient and finetuning a trial-and-error process.

RELATED WORK
This proposed work builds on two previous works by Wanjiku, Nderu & Kimwele (2022).The first work looks into selecting relevant data points using textural features.The authors use three datasets: Caltech 256, Stanford Dogs 120 and MIT Indoor Scenes on two pretrained models-VGG16 and MobileNetV2.The proposed approach adds datasets and models, extending the process by dynamically selecting fine-tunable layers in the transfer learning process.The authors use four datasets in the second study: CIFAR-10, CIFAR-100, MNIST, and Fashion-MNIST on six pre-trained models.In the second study, the authors evaluate the selection of pre-trained models' layers based on weights.They use cosine similarity and later D KL on the cosine similarity.The proposed work only looks at the D KL divergence and further utilises the data points previously identified in the processing pipeline.In this study, two smaller datasets are added to validate the model: one of the use cases of transfer learning is in cases of limited datasets.
The selection of data points in transfer learning has been documented in various literature.In a study by Weifeng & Yizhou (2017), data points with similar low-level features are identified and selected in the target domain to address the insufficient data using Gabor filters.The feature selection in this proposed study uses pre-trained CNN layer filters, while Ge & Yu (2017) used Gabor filters.Their work describes the features using histograms, while the proposed study uses conflated probability distributions.The two studies also differ through their datasets and pre-trained models: the researchers used three pre-trained models (AlexNet, VGG-19 and GoogleNet) and three datasets (Caltech 256, MIT Indoor Scenes and Stanford Dogs 120), while the proposed uses five pre-trained models (ResNet50, DenseNet169, VGG-16, InceptionV3 and MobileNetV2) and six additional datasets.Zhuang et al. (2015) compare the features between the source and target domains from a generative adversarial network (GAN) using Kullback-Leibler divergence on the features' probability distributions.The distributions formation utilises a temperature-softmax function which controls the samples used in the source domain.The use of the Kullback-Leibler divergence and temperature softmax function is also done in this research.The two studies differ in the pre-trained models and datasets used, where the researchers utilise the ResNet architecture and one dataset.In contrast, the proposed uses four additional architectures and nine datasets to validate the conflation method.Luo et al. (2018) utilise an optimal similarity graph to select low-level features in video semantic recognition.The researchers use semi-supervised learning to address the curse of dimensionality preventing information loss between video pairs while acquiring the features of the local structure.This approach differs from the proposed method in feature extraction (the proposed uses convolution layers and conflates the features) and dynamic fine-tuning.In contrast, the researchers use an optimal similarity graph.However, the two methods use divergence measures in comparing the low-level features.Gan, Singh & Joshi (2017) address the conflation of probability distributions intending to understand the semantics in the text strings utilising long short-term memory recurrent neural network (LSTM-RNN) in business analytics for entity profiles.The conflation method has also been used in geographic information systems (GIS) in merging geospatial datasets (De Smith, Goodchild & Longley, 2018), in the detection of robotic activities (Rahman et al., 2021), and in features dimensionality reduction involving large datasets as researched by Mitra, Saha & Hasanuzzaman (2020).
Royer & Lampert (2020) introduce the flex-tuning method.The method proposes finetuning a single layer while freezing the rest of the model.This process is iterated until a group of best unit layers is selected for use in the final transfer learning process.The researchers' approach differs from the proposed method based on the layers' selection criteria.The researchers select the weights based on the layer that performs better, while the proposed method selects the layer based on its positive weights.However, both methods consider the weights in the layer selection process.Furthermore, the proposed method integrates the selection of quality data points aiding overall network performance.Yunhui et al. (2019) introduce the SpotTune method that uses Gumbel-Softmax sampling on two ResNet architectures.In the method, a decision policy determines the best layers to be selected through a lightweight network that evaluates each instance.This approach differs from the proposed approach: the researchers use a lightweight network in layer selection using two ResNets and five datasets, while the proposed uses weights in layer selection, five pre-trained models, and nine datasets.
Other layer selection studies have considered evolution algorithms.Satsuki, Shin & Hajime (2020) utilise the genetic algorithm where genotypes (representing the layer weight) with the highest accuracy are selected for fine-tuning.The researchers improve the algorithm by using the tournament selection algorithm, experimenting with three datasets: SVHN21, Food-101 and CIFAR-100.Vrbančič & Podgorelec (2020) introduce the differential evolution algorithm that selects and represents the pre-trained model's layers using binary values.All the selected layers are assigned a binary value of 1.
All these documented layer selection methods have been evaluated on one pre-trained model and one or two datasets, which is insufficient and needs more evaluation.However, the proposed method uses more models and datasets.Furthermore, the feature extraction in the first phase of the model is done using the convolutional layers of the same pretrained model to be used in the transfer learning process.
Apart from selecting features, transfer learning is used in various industries, including biomedical, manufacturing, and deep learning model security.In medical imaging, numerous issues affect the data used in deep learning, including legal and ethical issues which limit the data size and the acquisition expense.Matsoukas et al. (2022) demonstrate the effectiveness of feature reuse in the early layers and weight statistics when using transfer learning.The researchers demonstrate that it is advantageous when using vision transformers (ViTs) through the feature reuse gain in transfer learning since they do not have the available inducted bias in CNNs.The adaptation noted in medical datasets flows from weights learned from the ImageNet and the extracted low-level features in the pretrained models.The researchers use four datasets: APTOS2019, CBIS-DDSM, ISIC 2019, and CHEXPERT on four models: two ViT models (DEIT and SWIN) and two CNN models (RESNETs and INCEPTION).They conclude that feature reuse plays a critical role in effective transfer learning, with the early layers showing a strong feature reuse dependence.Their work differs from the proposed method on the number of pre-trained models and datasets.Mabrouk et al. (2022) use transfer learning on three medical datasets-ISIC-2016, PH2, and Blood-Cell datasets to improve the Internet of Medical Things (IoMT) performance in melanoma and leukemia.Transfer learning extracts the image features while the chaos game optimisation selects the good features.In a skin classification task by Rodrigues et al. (2020), transfer learning is used on skin lesions, typical nevi and Melanoma using IoT systems.The researchers use VGG, Inception, ResNets, Inception-ResNet, Xception, MobileNet and NASNet as pre-trained models, applying SVM, Bayes, RF, KNN, and MLP classifiers.The researchers experiment with the method on two datasets: ISIC and PH2.Their classification study aimed to address issues faced by medical teams during lesion classification.These issues include using various sizes and lesions shapes, the patient's skin colour, personnel experience, and fatigue on the classification day.The work by these researchers uses more datasets and pre-trained models but differs from the proposed method on feature conflation and dynamic layer selection.Duggani et al. (2023) develop a hybrid transfer learning model from two pre-trained CNN models to improve classification for melanoma.In the study, they predict the finegrained differences in skin lesions on the skin surface.The features extracted from the two models are concatenated, and an SVM classifier is added at the end of the model.The concatenation improves the accuracy performance values of the traditional CNN models on the ISBI 2016 dataset.The researchers used AlexNet, GoogleNet, VGG16, VGG19, ResNet 18, ResNet50, ResNet101, ShuffleNet, MobileNet, and DenseNet201 as the pretrained models.Nguyen et al. (2022) have also documented skin lesions classification using the Task Agnostic Transfer Learning (TATL).The researchers concatenated the extracted features while this work conflates them, selecting the ones to evaluate the target task samples.
Transfer learning has been used further in medical imaging to classify other diseases.2022) use transfer learning to address the technical variability of MRI scanners and the differences in subject populations on the UK Biobank MRI data (three datasets) focusing on age and sex attributes.In all these medical imaging cases, Kim, Cosa-Linan & Santhanam (2022) note that the most common transfer learning models in medical scenarios include-AlexNet, ResNet, VGGNet and GoogleNet since they can be easily customised.
In fault-tolerant systems, Li et al. (2020) use transfer learning to address the limited data available in these systems.The researchers use simulation data on convolutional neural network architecture integrating domain adaptation techniques.The developed model is deployed on a pulp mill plant and a continuously stirred tank reactor.Nawar et al. (2023) use transfer learning to optimise power generation planning and bill savings potential.Their Building-to-building transfer learning model uses the deep learning-transformer model in forecasting power savings.The researchers evaluated the algorithm on a large commercial building using LSTM and RNN and concluded that the transformer model performed better than the LSTM and RNN architectures.In satellite data applications, researchers tap into massive-dataset-trained foundation models such as ImageNet and GPT-3 to improve the performance of downstream tasks in different satellite application domains.
In their work, Simumba & Tatsubori (2023) use foundation models by allowing pretrained model weights in cases of various input channels.Using weights helps the downstream applications address the difference between satellite data and computer vision models.The researchers test the approach on precipitation data-trained models.Transfer learning has also been used in human activity recognition (HAR) systems: An et al. (2023) use pre-trained models trained with offline HAR classifiers on new users.The researchers introduce representation analysis for transferring specific features from the offline users while maintaining a good complexity analysis in the target setting.Sharma et al. (2023) attempt to recognise human behaviour from real-time video.The researchers classify the behaviour as suspicious or usual using data from the real-time video frames on the Novel 2D CNN, VGG16, and ResNet50 pre-trained models.
In a recent study by Mehta & Krichene (2023), transfer learning enhances deep learning model security.The security of private models is paramount to protecting deep learning models that are plausible for bad actors to attack, revealing information from the training examples (differential privacy).The researchers propose using transfer learning as a promising technique for improving the accuracy of private models.The process involves training a model on a dataset with no privacy concerns and then privately fine-tuning it on a more sensitive dataset.The researchers simulate the adjustments on the ImageNet-1k, CIFAR-100, and CIFAR-10 datasets.

METHODOLOGY Proposed study's approach
The proposed method comprises two parts: the selection of quality dataset samples and the dynamic selection of the pre-trained model's fine-tunable layers, as shown in Fig. 1.
As shown in Fig. 1, the target and source domain data samples are compared based on their textural features resulting in a final target dataset as expressed in Eqs. ( 5) to (8).The pre-trained model layers are then selected for transfer learning to accomplish a target CNN task as shown in Eqs. ( 9) to (13).

Selection of quality image dataset
The selection of quality images involves extracting textural features from a target domain image, conflating its textural features' probability distributions, and comparing the resultant distribution with the conflated distributions in the source domain images.
Definition 1.The conflation of a target image's textural features.Given a target image T i1 , its textural features T if 1 ; . . .; T ifn , the conflated probability distribution can be expressed in Eq. ( 5) as; where T if represents a target image's feature.We can proceed and equate the conflated value to &T if as expressed in Eq. ( 6); Definition 2. The conflation of a source image's textural features.Given a source image S i1 , its textural feature vectors S if 1 ; . . .; S ifn distributions can be conflated into &S if as expressed in Eq. ( 7) below; where S if 1 represents a probability distribution of the first feature in a selected source image feature.
Once the source conflated distributions are identified, a vector of the source image conflated probability distribution is used to check its divergence from the target image conflated distribution, as shown in Eq. ( 8).
where D KL represents Kullback-Leibler divergence.Finally, we select images whose D KL is lower than the average D KL of the source image distributions.

Selection of fine-tunable layers in pre-trained models
The fine-tunable layers selection involves identifying convolutional layers in a pre-trained model, their positive and negative weights and selecting fine-tunable layers by utilising the weights' divergence.
Definition 3. Identification of a convolutional layer.Given a pre-trained model M, a convolutional layer C l is expressed as follows; where M l is the model's layer, M ln represents the model's layer name which identifies a convolutional layer if its name contains the keyword "conv"; otherwise, the layer is skipped.Definition 4. Identification of positive and negative weights.Given a convolutional layer C l , a weighting filter C lw is a vector of n Â n kernel size.Given two filters, x and y, we can reshape the tensors from C x lw 2 R nÃn and C y lw 2 R nÃn into C x 0 lw and C y 0 lw , respectively.We can express the positive and negative weight units as C x 0 þve lw and C x 0 Àve lw .Therefore each weight filter becomes a single tensor of positive and negative weights, as expressed in Eq. ( 10).
where C x 0 þve lw1 represents the first positive weight unit of filter x for the convolutional layer C l .
Definition 5. Divergence measure between layers.Given two single-dimensional layers C x 0 lw and C y 0 lw , we can calculate their differences by converting the vectors into probability distributions based on their positive or negative weights vectors.Utilising D KL , the divergence of the positive weight vectors is expressed in Eq. ( 11).
where p refers to a probability distribution.We can simplify this further by substituting pðC x 0 þve lw Þ with p x ð Þ and pðC y 0 þve lw Þ with p y ð Þ, as shown in Eq. ( 12).
The layers with the lower divergence measures are then selected for use in the finetuning process.

Algorithm
The following steps summarise the proposed approach: i) Select an image T i1 from a target dataset.
ii) Extract T i1 textural features using a pre-trained model M, giving vectors of features T if 1 ; . . .; T ifn .
iii) Convert the vectors of features into probability distributions v) Repeat the procedure for the source images.
vi) Calculate the average Kullback-Leibler divergence (D KL Þ of the source images. vii) Compare the D KL of T i1 ; . . .; T in to the average of the source selecting data points whose D KL is lower.These images form the final target dataset.
viii) Identify the convolution layers in pre-trained model M using the keyword "conv".
ix) Identify the positive and negative weights in each convolutional layer C l .
x) Create probability distributions of the positive and negative weight filter units.
xi) Calculate the D KL between positive or negative probability distributions between any two layers.
xii) Repeat the procedure in (xi).
xiii) Select the layers with the lowest D KL as candidates for fine-tuning.
xiv) Perform transfer learning on the target task utilising the target dataset in step (vii) and the selected layers in step (xiii).

Divergence measures
Divergence is a measure of the difference between any two probability distributions.This proposed approach uses D KL and compares it to four other divergence measures: Jensen-Shannon, Bhattacharya, Hellinger and Wasserstein.Kullback-Leibler: Theodoridis (2015) presents it as a measure between two probability distributions, as expressed in Eq. ( 12).The divergence gives a zero measure when the two distributions are equal.Adding the zero measure, we can rewrite Eqs. ( 12) to (13). where Þif and only if x ¼ y.Jensen-Shannon: A symmetrised version of the D KL that measures the distance between two probability distributions.Unlike the D KL , it has high computational costs in search operations (Nielsen & Nock, 2021).
Hellinger distance: A measure of the difference between any two probability distributions in a shared space.Also known as Jeffrey's distance but has a higher computational complexity than the D KL , as noted by Greegar & Manohar (2015).
Wasserstein distance: A measure of the difference between any two probability distributions.It uses the concept of moving an amount of earth and the distance involved.The divergence has been used in optimal transportation theory, as Villani (2003) noted.
Bhattacharya: A measure between two probability distributions that gives the cosine angle to interpret the overlapping angle between them.Like the other mentioned divergence measures, it is highly complex despite its good performance.
The Kullback-Leibler divergence has been used to model the physical microstructure properties of steel (Lee et al., 2020), separation of multi-source speech sources (Togami et al., 2020), and extracting features in the development of an impulse-noise resistant LBP.Yuhong et al. (2019) have used D KL to develop a scale filter bank in a CNN model to create a down-sampled spectrum from two distributions.

Data preparation
Since the approach utilises selected images from the larger datasets, the images are input into the pre-trained models with 224 × 224 pixel dimensions.The images are converted into grayscale, and their features are extracted using the first convolutional layer of the selected model.The pre-trained model is the feature extractor since it is also used in the final transfer learning process.This attribute makes it ideal for preparing the final transfer learning environment.Once the feature selection is made through the proposed approach, the dataset is categorised as a training or test dataset.

Experimental setup and settings
This study uses five pre-trained models: ResNet50, DenseNet169, InceptionV3, VGG16, and MobileNetV2 (used to show the proposed approach performance on small networks).The inputs to the models have been scaled to 224 × 224 pixels with the InceptionV3 model using an upsampling layer and VGG16 taking 4,096 neurons in the last layer before classification.These pre-trained models have been trained on the ImageNet dataset.The pre-trained models and their parameters are listed in Table 2, with InceptionV3 having the most layers (Team, 2023).The study also uses a custom CNN model to show the proposed methods' effects on a non-pre-trained model.
The experiments have been conducted using the TensorFlow framework using the Keras library on the PaperSpace platform (A4000, 45 GB RAM, 8 CPU with 16 GB GPU).
The training of the models involved the selected datasets without data augmentation.However, during training, fine-tuned model regularisation was performed using Dropout and Batch normalisation.

Proposed approach methods
The proposed approach introduces four methods from image textural features and pretrained model layer selection views.

Textural features view in image data
This view utilises two methods: Above average D KL and below-average D KL .
Above average D KL : The method uses data points whose D KL is higher than the average for all the data points in their category.
Below average D KL : The method uses data points whose D KL is lower than the average of other samples in their class.

Layer selection view in pre-trained models
This view uses two methods: Positive weights D KL and negative weights D KL .
Positive weight D KL : The method evaluates divergence between two distributions formed from positive weights of filters of a pre-trained model's convolutional layers.
Negative weight D KL : The method compares two distributions formed from negative weight filters in the pre-trained model's convolutional layers.
D KL is used and compared to four divergence measures in both views to determine the reasons behind its selection.

Commonly used transfer learning methods
Since the introduced methods in the proposed approach aim to improve the transfer learning process, they are compared with these four commonly used methods in transfer learning: Standard fine-tuning: The method replaces the classification layer with a classification layer for the target task's classes.
Last-k layer fine-tuning: It involves replacing the last-k layers in the network with other layers suitable for the target task.These layers can be 1, 2 or 3 in the pre-trained model.

RESULTS
This section looks at the results of the four methods.It is presented as follows: results on the textural features methods, the dynamic selection, and the commonly used methods and complexities.The results indicate the performance measures using accuracy.

Results on conflated textural features methods
The two textural methods: GLCM and LBP accuracy performance, are shown in Tables 3-7 for the various datasets and models.GLCM's energy and homogeneity give the best accuracy performance compared to the LBP and GLCM's correlation.LBP gives the lowest performance compared to the GLCM properties, while ResNet50 performs lowest on CIFAR-100 among the datasets.
In the VGG16 model, correlation and energy perform best among the GLCM properties.The CIFAR-10 give the highest performance across all the properties, while CIFAR-100 still gives the least accuracy.
GLCM's Energy and LBP perform better than the other properties on the InceptionV3 model.GLCM's correlation and homogeneity give the least performance for the InceptionV3, and the CIFAR-10 dataset gives the best accuracy performance across the four textural properties among the datasets.Figure 2A shows the conflated performance of CIFAR-10 samples on the VGG16 pre-trained model.
Figure 2 shows that samples below the average D KL perform better, illustrating the importance of selecting quality data points in a dataset.
GLCM's energy and LBP give better results than the other properties on the DenseNet169 model.GLCM's correlation gives most of the least accuracy performance values, and the CIFAR-100 dataset still gives the least performance.
GLCM's energy and homogeneity give the best accuracies for the MobileNetV2 pretrained model, while the GLCM's correlation gives the least accuracy.GLCM's energy gives more than half the best results of all four properties, followed by GLCM's homogeneity, LBP and correlation.Figure 3 shows the energy property performance of Fashion-MNIST and MNIST on the MobileNetV2.
The dataset samples with below-average D KL give high accuracy when using both MNIST and Fashion-MNIST, showing the divergence between the data points and their effect in adapting the pre-trained model in the target task.The relevance of these target data points to the source samples is essential to the adaptation.

Results on dynamically selected layer methods
The dynamic layer selection methods form the second part of the model.The results of the selected layers for the pre-trained model are presented in Tables 8-12.
Before using the selected conflated D KL , MNIST performs best when utilising selected dynamic layers, with the CIFAR-100 dataset giving the least performance.However, using conflated D KL samples of the CIFAR-100 dataset and the positive D KL dynamically selected layers improves the ResNet50 pre-trained model's performance, as seen in Table 8.The negative D KL dynamically selected layers also result in minimal improvement due to the conflated dataset samples.
Using dynamically selected layers performs well compared to the standard methods (compared in Results on Methods against commonly used measures).However, an improvement is noted when employing samples with below-average D KL , as seen in Table 9.The CIFAR-100 dataset gives the least accuracy, while the MNIST dataset performs best.Figure 4 shows the performance of the ISIC 2016 dataset on the VGG16 pre-trained model, with and without conflated selected images and dynamic layers.
The dataset with data points below average D KL performs better on VGG16 pre-trained model than those with above-average D KL .The performance improves as the dynamically selected layers fine-tune the pre-trained model.In Fig. 5, the selected layer (layer 1) with A similar trend of improvement in fine-tuned dynamically selected layers' model is noted in the InceptionV3 pre-trained model, as seen in Table 10.The MNIST datasets still perform well, probably owing to their size of data points and classes compared to the poorly performing CIFAR-100 dataset.The two smaller datasets: ChestX-ray8 and ISIC 2016, also perform well.The trend of low performance continues with the CIFAR-100 dataset for the DenseNet169 pre-trained model, as seen in Table 11.The MNIST datasets continue to perform best by utilising below-average selected samples, improving the adaptation process's performance in the target task.
The MobileNetV2 pre-trained model gives similar results to the other four pre-trained models, as shown in Table 12.The MNIST dataset gives the best accuracy among the datasets, and a performance improvement is noted when data points of below-average D KL are used together with the positive D KL dynamically selected layers in the transfer learning process.The excellent performance of the positive D KL when using the selected samples is seen in the precision shown in Table 13.The precision determines the repeatability of obtaining the models' good performance, indicating how well the model can predict the correct class.Table 14 looks at the recall performance of using the positive samples to determine the model's performance in correctly identifying the positives.Figure 6 shows a confusion matrix of the ISIC 2016 testing dataset on MobileNetV2, where the number of false negatives and false positives is lower in classifying benign or malignant conditions.
In Table 14, the recall values are lower than the accuracy and the precision by a slight margin.However, it proves that the model can identify a good proportion of the positives during classification.

Results of proposed methods against commonly used transfer learning methods
The introduced methods of the proposed approach have been compared to the three commonly used methods of the k-1, k-2 and k-3.Table 15 shows that the introduced methods outperform these regular methods.
The combination of the positive D KL in dynamic selected layers and the use of data points below conflated average D KL give an average improvement of 0.87% across all the five pre-trained models, with the DenseNet169 model giving the best improvement of 1.57% and the VGG16 model giving the slightest improvement of 0.06% in comparison to the standard fine-tuning as seen in Table 15 for the ChestX-ray8 dataset.Among the last-k methods, the combination dramatically improves on the k-3.A similar performance is shown in Fig. 7. Combining data points below the average D KL and dynamically selected layers gives better accuracy than the commonly used and individually introduced methods.Utilising above-average D KL samples without dynamically selected layers gives the least performance, as shown in Fig. 7. Using below-average samples gives the second-best method.Figure 8 shows the ISIC 2016 dataset sample with and without transfer learning in  the VGG16 pre-trained model.When the same samples are used in transfer learning, it is noted that below-average D KL performs better and takes less time than a typical convolutional neural network (a 12-layer CNN listed in Table 2).The typical CNN takes 50 s longer to train the ISIC 2016 samples.

Results on computational complexities in proposed methods
The proposed approach complexity is evaluated against four divergence measures: Wasserstein, Hellinger, Jensen-Shannon and Bhattacharya.The results in Table 16 show that the D KL has a good balance of memory and time complexities.The Wasserstein and the Jensen-Shannon have better time complexities but consume more computing resources than the D KL .Tables 16 and 17 show that it takes about 70.37 milliseconds (ms) to select a sample and pass it through a selected layer in ResNet50.The complexities of the D KL are better than Hellinger's and Bhattacharyya's, which take 99.63 and 239.14 ms, respectively.According to Yossi, Carlo & Leonidas (2000), the Wasserstein is reported to be more complex than Bhattacharyya, forming a reasonable basis for selecting D KL in this study.Further evaluation of the divergence measures is shown in Fig. 9.
The models' accuracy performance when using D KL conflated dataset samples is closest to Hellinger's conflated dataset samples, as seen in Fig. 9, outranking the other divergences.However, as reported in Tables 16 and 17, Hellinger has a higher computational complexity.

DISCUSSION
From the results, the performance of the pre-trained models is improved at two levels: selecting the relevant data points and utilising dynamically selected layers.In selecting the relevant data points, GLCM's energy and homogeneity properties have been noted to perform well compared to the other properties: GLCM's correlation and LBP.The contribution of these two to good performance can be attributed to the excellent neighbouring of pixels with similar grey levels.The GLCM's homogeneity causes the low density of the pixels' edges.Chaves (2021) and Mathworks (2023) note that pixels along the diagonal change smoothly to the ones distant from the main diagonal.The samples with below-average D KL contain values closer to each other, with many adjacent pixels having similar values.Pixels of lower values are far from the diagonal.
The performance of the GLCM's properties is better than the LBP textural descriptor in many instances.This performance can be attributed to better uniformity and simplicity in indicating that the use of GLCM properties can aid in improving CNNs' performance, especially in cases of inadequate data-a use case for transfer learning.In using belowaverage D KL samples, the selected data points utilise their lower informational differences to the source samples adapting the target task.At the dynamic selection of layers, it is noted that layers with lower positively signed weights to each other give higher performance than the negative ones.The positive weights in a layer are considered excitatory, as Najafi et al. (2020) noted, with the ability to select stimulating features during the model's training.As noted in Fig. 5, the weights of the first channel of the first layer would start the convolutional process with a higher magnitude  (the RGB values of colours closer to white would be near 255) towards the divergence than the counterpart filter in the first channel of layer 9.This property leads to better convergence and faster training, as Delaurentis & Dickey (1994) noted.In the case of using negative weights, the model cannot descend well hence the use of positive weights, as noted by Shamsuddin, Ibrahim & Ramadhena (2013).The use of weights is essential to the model's training, where the convergence happens despite the use of the other parameters since its change has an effect in reaching the global minima.The change in weight sign affects the magnitude change and consequent change in the direction of the descent.
Sidani gives more intuitive reasoning on the effects of positive weights on the training process by noting that an increase in the weight of the previous and the current derivative stabilises the network more, leading to quicker descent.Therefore the positive weights in the training process can correct the back-propagation errors leading to a better path to the global minima.The negative weights D KL method is also noted to improve the use of better samples.This improvement by the negative weights (inhibitory) is because they still  stabilise the training process, especially in cases of exploding gradient.This process slows the learning process but ensures the model can capture the features well enough.However, the positive weights have the upper hand in directing the model to the global minima.
It is also noted that heavy pre-trained models like DenseNet169 with many layers (as shown in Table 2) and datasets with many classes like CIFAR-100 give a lower performance.These could be a result of parameter complexity requiring more training.This behaviour has been reported by Vrbančič & Podgorelec (2020) in the DEFT method.The heavy DenseNet169 model also gives higher complexity, as Tables 16 and 17 note.
The selected D KL methods have low-medium time complexity compared to the other divergences despite the lower time complexity by Jensen and Wasserstein and Hellinger's better training loss curve to the D KL .However, comparing the D KL memory complexity and the combined complexities of the other divergences, as seen in Tables 16 and 17, can unfold into an extensive complexity which guided its selection for use in this study.

CONCLUSION
This article introduces the conflation of features in selecting quality data points in a target domain dataset and using weights in the dynamic selection of fine-tuning layers to improve the transfer learning process.The enhanced model demonstrates that using the correct data points and suitable layers can improve the performance of a pre-trained model to the commonly used transfer learning methods.The model has been evaluated on five pre-trained models and nine datasets.The results demonstrate the divergence between data points and layers, showing how transfer learning adaptation is affected by information divergence at the data and layer levels.However, the approach has a higher time complexity than the commonly used methods due to adding the extra step in dynamic layer selection.The approach gives a better method in cases of inadequate data reducing cases of trial and error in selecting the right data points and layers for fine-tuning.
Future work can be extended into other architectures apart from CNNs to understand further the importance of divergence in data samples and the models' layers.
Chouhan et al. (2020) utilise five pre-trained models to classify pneumonia images.Niu et al. (2021) classify COVID-19 lung CT images using the distant domain transfer learning (DDTL) model on three source domain datasets (unlabeled Office-31, Caltech-256, and chest X-ray) and one target domain dataset (labelled COVID-19 lung CT).Their study aims to reduce the distribution shift between the domain data.Zoetmulder et al. (2022) use CNN pre-trained models on three brain T1 brain segmentation tasks: MS lesions, brain anatomy, and stroke lesions using natural images and T1 brain MRI images.Raza et al. (2023) use transfer learning to classify and segment Alzheimer's disease on the brain's grey matter images.Holderrieth, Smith & Peng (

Figure 4 A−Figure 5
Figure 4 A plot of the Melanoma dataset on VGG16 with conflated dataset and dynamically selected layers.Full-size  DOI: 10.7717/peerj-cs.1601/fig-4

Table 3
VGG16 textural features accuracy performance.KL is seen to have more excitatory weights (light ones) in the first channel of the layer than layer 9 of the VGG16 model.Its selection coincides with the first layers being able to extract features better.

Table 7
ResNet50 accuracy performance using the selected DKL methods.

Table 8
VGG16 accuracy performance using the selected DKL methods.

Table 9
InceptionV3 accuracy performance using the selected DKL methods.Table10DenseNet169 accuracy performance using the selected DKL methods.

Table 11
MobileNetV2 accuracy performance using the selected DKL methods.

Table 13
Computational complexity of divergence measure on Caltech256.Table14MobileNetV2 Recall performance using the selected DKL methods.

Table 16
Computational complexity of divergence measure on Caltech256-sample selection.

Table 17
Complexity comparison to other divergence measures-layer selection.