Transfer learning data adaptation using conflation of low‐level textural features

Adapting the target dataset for a pre‐trained model is still challenging. These adaptation problems result from a lack of adequate transfer of traits from the source dataset; this often leads to poor model performance resulting in trial and error in selecting the best‐performing pre‐trained model. This paper introduces the conflation of source domain low‐level textural features extracted using the first layer of the pre‐trained model. The extracted features are compared to the conflated low‐level features of the target dataset to select a higher‐quality target dataset for improved pre‐trained model performance and adaptation. From comparing the various probability distance metrics, Kullback‐Leibler is adopted to compare the samples from both domains. We experiment on three publicly available datasets and two ImageNet pre‐trained models used in past studies for results comparisons. This proposed approach method yields two categories of the target samples with those with lower Kullback‐Leibler values giving better accuracy, precision and recall. The samples with the lower Kullback‐Leibler values give a higher margin accuracy rate of 0.22%–9.15%, thereby leading to better model adaptation and easier model selection process for the target transfer learning datasets and tasks.

descriptors of the image, while the high-level looks at the interpretation of the image data. 7 For successful image analysis, extraction of the features takes precedence, 8,9 and there are many applications in feature extraction.
Identification of low-level features in an image forms the basis of image detection and classification. Image analysis involves the acquisition of quantitative information from the pixel values. Low-level features include colour, shape and texture. 10 This work looks into the low-level textural features, which are the geometric arrangement of the grey levels of image. 11 The texture is a function of spatial variation of the brightness intensity of an image's pixels. 12 The texture is a vital component of human visual perception and is used in many computer vision systems. 13 According to Liu and Kuang, the texture image affects the accuracy of subsequent classification. 14 Furthermore, textural images give specific distribution patterns and repeated sequential dispersion of pixels' illumination in the image. A textural analysis is an essential measure of the spatial arrangement of the grey tones within the neighbourhood of a pixel. 15,16 Texture analysis is used in CBIR, image classification, and medical image processing, among others, as noted by Singh and Srivastava. 17 Textural features differentiate the roughness, smoothness and other features in image. 18 In image features extraction, a CNN is an excellent textural feature extractor. 19 According to Fekri-Ershad, 8 textural methods can be divided into four main categories: statistical, structural, model-based, and transform. Humeau 13 outlines that feature extraction methods are categorised into seven classes: statistical, structural, transform-based, model-based, graph-based, learning-based, and entropy-based approaches.
In the standard image processing phases, extracting textural features would involve using low-pass filters (such as Gaussian blur) to denoise the image, image transformations and normalisation and extraction of the features using Gabor filters. The extracted features would then be quantified using descriptors through high-order statistics (mean, variance), first-order statistics (such as entropy, standard deviation) as well as second-order statistics (such as correlation and sum entropy). The measures would then be subject to some relevance indices (such as information gain) for ranking the features for effective selection. Classification of the image utilising the selected features would use a traditional machine learning classifier such as the Support Vector machines (SVM), Radial basis functions (RBF), Random Forest, Naïve Bayes, K-Nearest neighbour (KNN), Random Tree or a convolutional neural network (CNN).
With the development of the CNN models, the reuse of knowledge in these models became possible, making it easier for researchers to work with data from various domains. Fine-tuning is a transfer-learning strategy that enables models to work with smaller datasets, as noted by Menegola et al. 20 when they used a small data set of around 1000 images of skin lesions. Their experiment noted that working from scratch with small datasets could not draw general conclusions. Since the explosion of CNNs, several pre-trained models have been developed. Szegedy et al. 21 introduced a 22-layer network named GoogLeNet (Inception), which uses the inception blocks. He et al. 22 introduced using ResNet blocks (ResNet architecture), which learn the residual, training even deeper models than the regular function learning. Other famous models include InceptionV3, 23 MobileNets and the VGG family.
Transfer learning involves the reuse of knowledge across task domains. 24 Formally, transfer learning consists of a source domain D s , target domain D t , source task T s , and target task T t ; there is a predictive transferability function f t (.) in D t that utilises the knowledge in D s and T s . However, D t ≠ D s or T s ≠ T t .
Transfer learning can be used in many applications, including electronic health records and sentiment analysis. A target model leverages information learnt in a source domain, optimising the results of the target task. 25 Transfer learning is most effective when the source and the target have some form of relation preventing negative transfer. Transfer learning is categorised into three learning paradigms: (1) inductive, where the tasks are different, but the domains are related or different with some labelled target data. (2) transductive, where the tasks are similar, but the domains are different with no labelled target data, and (3) unsupervised transfer learning, where tasks and domains are different. Transfer learning can be described in two perspectives based on feature and label space: (1) homogenous transfer learning, where both domains have similar feature spaces, and (2) heterogeneous transfer learning, where the domains have different feature and label spaces.
The categorisation can be viewed from solution-based approaches: (1) instance transfer learning involves re-weighting instances in the source domain for direct use with the target data due to reduced marginal distribution. (2) Feature representation transfer learning involves creating a new feature space from both domains that can be symmetric or asymmetric. (3) Parameter transfer learning involves sharing parameters between the source and target models, and (4) relational-knowledge transfer learning, where a relationship is defined between the two domains. This categorisation has further been discussed by Zhuang et al. 24 and Zhang and Gao. 26 Other categorisation methods are based on deep transfer learning: instance-based, mapping-based, network-based and adversarial-based learning, which has recently gained much attention. Furthermore, the plurality of transfer learning has been achieved by combining data, data properties, and model approaches.
In a study by Liang, Fu and Yi, 27 four transfer learning methods have shown excellent transfer learning potential in recent years: (1) transitive transfer learning involves using an intermediary dataset between the source and target sets in the context of weak similarity. (2) Lifelong transfer learning involves intelligent agents that can adapt to various learning techniques to adapt and learn new knowledge in new environments. (3) Transfer reinforcement learning involves a reinforcement agent that can transfer knowledge and behaviour across many tasks, and (4) adversarial transfer learning involves using a generative adversarial network with two generative and discriminative models. The two models train to intend to estimate the probability that a sample comes from the training data to the generative model. This process fools the discriminative model as much as possible, with the generative model capturing the data distribution. GANs are used in transfer learning cases such as image-to-image translation 28 focusing on input data adaptation as described by Zhang and Gao. 26 In a survey by Zhuang et al., 24 transfer learning can further be categorised based on instances (semi-supervised learning), features (multiview learning) and tasks (multitask learning). In a study by Kouw and Loog,29 transfer learning can also be evaluated from data adaptation: (1) sample-based methods that correct bias in data sample procedure. (2) Feature-based methods that reshape data feature space, and (3) inference-based approaches that incorporate data adaptation into the parameter estimation procedure. Among the transfer learning methods, instance re-weighting, 30 deep adaptation, 31,32 classifier adaptation, 33 universal representation 34 and adversarial adaptation 35 lead to recently developed transfer learning methods.
This work utilises statistical approaches of textural analysis to identify samples with better adaptation in transfer learning. Statistical textural analysis methods include the Co-occurrence matrix (second-level histogram), 3 the histogram features and the Linear binary patterns that look at the local spatial structures. 36 Other textural analysis algorithms exist, such as the grey scale level co-occurrence matrix, 11 binary Gabor pattern, 37 the local spiking pattern, 38 local binary grey level co-occurrence matrix (LBGLCM) 39 and the Grey level run-length matrix (primitive length texture features). 40 Meenakshi and Gaurav 41 have reviewed other methods thoroughly.

Problem in domain data adaptation
Transfer learning involves the reuse of a source domain knowledge in a target domain. However, as noted by Amir et al., 42 domain data adaptation is a significant concern, especially in cases where the target data samples are few. For domain data adaptation, it is essential to identify samples in the target that closely match (or are of good quality) the source domain for effective knowledge transfer (better performance). This identification can be made by comparing the target and the source domains' low-level dataset characteristics (textural). Once the quality target images are identified, the samples can be used as is or increased using data augmentation methods outlined by Shorten and Khoshgoftaar, 43 generating new data instances.

Contribution
In addressing target data adaption in a pre-trained model, this paper: • Proposes the conflation of extracted CNN features to compare similar features in the source and target task datasets. Combining the various feature distributions gives a better overall image features distribution by removing feature redundancies to consolidate the textural information from selected features, giving an overall textural measure of the image. 44 By identifying the textural measures of the images in the source and target datasets, the features can be easily compared from the image perspective representing its textural features.
• Addresses the selection of quality target dataset that improves the pre-trained model's performance. The selected images can be increased by augmentation, ensuring a more reliable model. Using a quality dataset is necessary to apply machine learning algorithms effectively. The use of machine learning in medical diagnosis, 45,46 such as medical imaging with low-quality images or signals, could lead to the wrong diagnosis. In transfer learning, quality can be described as transferring knowledge in related tasks for positive transfer. Comparing textural features between the images in the source and target domains and identifying closely related images gives a quality measure for improved transfer. Therefore, the proposed approach gives an alternative method for improving the selection of target samples that can yield a higher chance for positive transfer. These samples can be increased using methods such as data augmentation.
• Increases the chance of selecting better pre-trained models; the features extracted by the pre-trained model are used in transfer learning, making the process much more reliable, and reducing the current trial process and error in selecting pre-trained models. The selection of pre-trained classifiers is becoming a problem in machine learning. 47 Despite lots of unlabelled data, many industries slide their data hoping to pick a pre-trained model that works. However, this trial and error and efforts to hope can be drastically reduced by the proposed approach where several samples of the source domain are used in selecting unlabelled samples in the target that works for the given selected pre-trained model. This approach significantly improves the reliability of the settled-on pre-trained models.
The remainder of this article is organised as follows: Section 2 examines the various approaches to selecting adaptive target datasets in transfer learning. Section 3 discusses the proposed approach. Section 4 discusses the datasets and experimental environment, while Section 5 discusses the results. Finally, section 6 gives a conclusion and future exploration.

LITERATURE REVIEW
This section looks at the various data adaptation studies, especially the data-based interpretation using the feature transformation strategy. 24 This study looks into homogenous transfer learning to reduce the distribution difference between the source and the target domain data instances through textural low-level feature mapping.

Label efficient learning of transferable representations across domains and tasks
In this study by Luo et al., 48 two deep CNNs are used to compare target and source domain images utilising a generative adversarial network (GAN). The features of the domains are then compared using Kullback-Leibler divergence (DKL). The approach uses a softmax with temperature to calculate the semantic loss and controls the number of samples in the source domain. The temperature variation also allows the target points to be similar to multiple source classes when high and one class when small. In this approach, the researchers used one architecture (ResNet); therefore, the applicability cannot be extended to other models unless extensive experiments are used. Again, some architectures may not apply since the policies are based on residual blocks. The proposed approach seeks to identify a general method that uses convolutional neural networks. However, the researchers' method is shown relatively well in target domains with few samples. It is extendable to video action recognition, which gains significant per-video performance despite little changes in per-frame prediction.

Deep features for training support vector machines
In this work, the researchers aim to explore the classification performance improvement of quality textural features extracted from the middle and last layers of pre-trained models. 49 They use shallower and deeper layers, reducing the features' dimensionality using discrete cosine transform (DCT) and principal component analysis (PCA). However, PCA does not perform well, resulting in the use of DCT. The features are analysed using LBP and ranked using Chi-square feature selection applying DCT on each image channel on DenseNet201, ResNet50, and GoogleNet pre-trained models. It uses a support vector machine (SVM) as a classifier due to its reduced training time and good results. For ResNet50, features are selected from the middle layer, with one layer selected in ten layers plus the last four layers. The selection of the layers is based on the sequential forward floating selection. This approach uses the middle and last layers, unlike the proposed method that utilises the first layer in textural feature extraction. Their approach fails to report on the performance of the bottom layers reported in the literature to give the best distinctive features. It also fails to give the PCA results, only mentioning that it could have performed better in dimensionality reduction. However, DCT is applied in dimensionality reduction. This is different in the proposed approach, where dimensionality reduction is made by converting the feature vectors into one-dimension vectors, as described in section 3.2. The researchers' method is tested thoroughly on thirteen virus datasets and three pre-trained models, giving consistent performance.

Selective fine-tuning (borrowing treasures from the wealthy)
This study addresses the data insufficiency in the target domain by using images with similar low-level characteristics in the source domain. 50 This process utilises two Gabor filters that return histograms of filter bank responses and kernels in convolutional layers of AlexNet, GoogleNet and VGG-19 pretrained ImageNet models. The approach uses the nearest neighbour approach to match the features. The approach utilises Stanford Dogs 120, Oxford Flowers 102, Caltech 256, and MIT Indoor 67 as the datasets. A source domain with sufficient data is used simultaneously with a target learning task to identify that subset of closely related images in the target domain. In this approach, the researchers differ from the proposed method by use of additional Gabor filters, which literature has shown to behave similarly to the first convolutional layers. 51 Searching images using the K-Nearest neighbour is also computationally expensive compared to the simple arithmetic mean used in the proposed approach. However, the researchers' approach captures sharable layers between the source and target datasets, further improving the accuracy of the transfer learning process.

Using filter banks in convolutional neural networks for texture classification
This study proposes a simple architecture (Texture CNN) that explores texture features utilising filter banks in a CNN. 51 They use orderless texture descriptors from the AlexNet CNN layer features output. Furthermore, they use two CNN layers and an energy-pooling layer that is finally connected to the softmax function for classification. The datasets used in this approach include ImageNet subsets (ImageNet-T, ImageNet-S1 and ImageNet-S2 with 28 classes). This approach is tested on AlexNet pre-trained model only despite the use of from scratch model; the behaviour can be tested on other pre-trained models to find generality. The proposed approach utilises the first convolutional layer to extract the edge-like features widely used in textural analysis, unlike this study that utilises the intermediate layers, which allow weight sharing adding energy layers to improve descriptors. However, by using the energy layer, shape information is discarded due to its little significance in improving the model performance, giving a simple and efficient integration.

Deep convolutional neural networks and maximum-likelihood principle in approximate nearest neighbour search
This study uses dissimilarity measures utilising the nearest neighbour rules and the probability dissimilarity measures. 52 It uses the first-found image reference criterion, where a threshold is estimated as the β-quantile of the sequence. A Gaussian kernel convolves an input image to reduce image noise. The segmented image boundary is binarily eroded to exclude artefacts attributed to the Gaussian smooth. In standardising imaging space, intensity values are normalised across all the voxels. The approach uses the Gaussian mixture model (GMM) in extracting features (intensity) and uses three statistical features: mean, standard deviation and variance. The performed experiments use ResNet50 and VGG13 architectures, and the extracted features are classified using a random forest classifier. In this study, the researcher used the last layers of the CNN, unlike the proposed approach that utilises the first layer of the pre-trained models. It is more computationally expensive than the proposed approach due to its recursive algorithmic steps. However, both approaches utilise distance probability distributions.

2.6
Making sense of spatio-temporal preserving representations for EEG-based human intention recognition The researchers propose the transformation of the 1-D vector to a 2-D Mesh Electroencephalography (EEG) signal hierarchy that has shown better results with the εmissing readingsε compared to the chain-like 1-D EEG vector that is limited in great connections between the brain activity and the brain structural regions. 53 The approach uses a deep neural network technique that uses two recurrent neural networks (RNNs) for space and time dimensions. Spatial features are extracted from each data mesh, and RNN extracts the temporal features with zero padding in each convolutional layer to prevent information loss. This improved deep neural network technique addresses the complex pre-processing phases or neglects the spatial and temporal information of the typical deep neural network techniques. The EEG signals stream to segments of 2-D meshes classifying each segment as one of the k-categories. The experimentation uses the PhysioNet EEG dataset and an actual case study with 108 subjects. The researchers' approach caters for the temporal features using RNNs which be further studied in this work. However, the researchers should have tested their approach on mobile-friendly models like the proposed approach: MobileNetV2. They only cite that the proposed framework's many parameters may not support mobile devices.

An adaptive semi-supervised feature analysis for video semantic recognition
The researchers propose an optimal similarity graph in joint feature selection using semi-supervised learning in video semantic recognition. 54 This method addresses the curse of dimensionality in selecting low-level visual features, minding that practical applications of video semantic recognition use a vast majority of videos that do not have labels, making annotation expensive and time-consuming. Semi-supervised learning aims to learn the discriminative subset of the original features from the label and unlabelled data. In their proposed approach, they assume that close labels have high similarity and vice-versa, which makes it easier to make joint feature selections exploring the local structure and thereby learning optimal similarity graph simultaneously. Given an initial similarity matrix, its neighbour assignment is updated based on the label difference of each video pair. The optimal similarity graph can address the curse of dimensionality and also captures the intrinsic local structures well. The experiments are conducted on five datasets with the optimisation algorithm converging with no more than twenty iterations to the local minimum. This approach uses a similarity matrix in joint feature selection, a concept that can be explored in capturing the intrinsic local features between the source and target dataset. This approach differs from the proposed approach using DKL in finding low-level features similarity, while the researchers' approach uses the similarity matrix in join feature selection.

A semi-supervised recurrent convolutional attention model for human activity recognition
In efforts to address the shortage of labelled activity data in human activity recognition (HAR) using semi-supervised learning methods, a pattern-balanced recurrent neural network attention model is proposed. 55 The model aims to extract salient information that gives actual activity, obtaining valuable information from limited training data while addressing inter-person variability and interclass similarity: a longstanding challenge in HAR. The proposed semi-supervised deep model acquires data from multimodal sensory data, extracting and preserving latent activity patterns from imbalanced data sets with the added attention mechanism striving to balance the less labelled data. This process is done by utilisation of salient features while ignoring the irrelevant signals, involving training the model using reinforcement learning to extract the salient patches. The approach also uses a Partially Observable Markov Decision Process (POMDP) in training and optimisation problems, utilising three datasets and two baselines. This approach differs from the proposed approach in using attention to addressing imbalanced datasets using salient features. The proposed method addresses data adaptability through low-level textural similarity in the target domain.

Conflation literature
In a study by Mitra, Saha and Hasanuzzaman, 56 they approximate a unified probability distribution in embedded learning for nearest neighbour search and dimensionality reduction in large datasets. This conflation aims to generate a unified embedding in low-dimensional space that preserves the neighbourhood identity of the datasets in multiple views. Rahman et al. 57 use the conflation concept in detecting robotic activity in a Recurrent Neural Network (RNN)-based sequential model. The researchers integrate the domain knowledge into the RNN-based sequential prediction using a Markov Logic Network (MLN) classifier that automatically learns the data constraint weights. The MLN proposes using two methods: the last layer, where the values of the hidden RNN layer are combined with the weights from the logic constraints, and the conflation of the class probabilities learnt from the RNN predictions and the constraint weights. From this study, the conflation of the constraint weights and the class probabilities improve the Long Short-Term Memory (LSTM) accuracy and show better regularisation capability on unseen data.
In an attempt to provide a more comprehensive profile of an entity, Gan et al. 58 use the conflation concept to create a character-level deep conflation model that can understand the semantic meaning of text strings and match them at the semantic level. The model encodes the input text strings from the characters, which are used as finite dimension feature vectors. The matching between the strings uses cosine similarity. It uses the LSTM-RNN and Convolutional Neural Network for developing a better entity profile with two or more tables of entities database in business data analytics.
The conflation of features has also been widely used in Geographic information systems(GIS). [59][60][61] Chen and Knoblock 62 define the process as the compilation or reconciliation of any two different geospatial datasets of an overlapping region. Ashok, Sharad and Kevin (2011) use a graph theoretic approach in conflating disparate data sources, matching the features from multiple sources. 63 Sagi and Yerahmiel 64 use conflation in the registration of GIS and photogrammetric data using local transformations of the linear features from datasets. This method improves GIS accuracy and better utilisation of standard feature extraction techniques.

Other textural analysis and classification literature
In a study by Wang et al., 65 deep neural networks extract statistical context features using learnable histograms, which are used as an additional layer to the deep neural network. Simon and Uma 66 use a CNN utilising the first layer of the network to extract the features using softmax as a classifier. Their approach aims to analyse the efficiency of CNN features in texture classification. In a further study by Hosseini et al., 67 feed feature responses extracted from Gabor filters are used in a CNN alongside the input image. They are fed as an input tensor (stack of image and Gabor responses or fused image-weighted sum of image and Gabor responses). Other works use multilinear principal component analysis (MPCA), the hamming distance 68,69 and CNN to extract features from images utilising transfer learning. 70,71 However, the extracted features come from all the layers except the classification layer. They also use the Euclidean distance to measure the differences between the query and the stored image. 72 It also uses principal component analysis to reduce the principal components.

METHODOLOGY
This work presents a novel way of identifying target domain dataset samples closer to source domain features utilising feature extraction via the target pre-trained model. The use of the target pre-trained model as a feature extractor is motivated by the fact that it simulates the actual transfer-learning environment. The extracted features' probability distributions are conflated and compared using DKL. Table 1 lists the main notations used in the proposed approach.

Divergence distance measures
A divergence measure is a function of two probability distributions that give their differences. Examples of divergence measures include the Kullback-Leibler, Bhattacharya, Jensen-Shannon, Wasserstein and Hellinger distances. These example divergences are used in this work to select DKL. The Hellinger distance measures the difference between any two probability distributions on a shared space of probability distributions. Its values range between 0 and 1, with 0 being the least distance, as pointed out by Greegar and Manohar. 73 However, it has a high time computational complexity to DKL.
The Jensen-Shannon is viewed as a symmetrised DKL to the average mixture distribution, as explained by Nielsen and Richard et al. 74 However, it has a higher similarity search cost compared to DKL.
The Bhattacharyya measures the overlapping degree of any two probability distributions. This cosine angle gives the geometric interpretation of the distributions. However, like the DKL, it is non-symmetrical. As Erick, Petr and Eva 75 point out, it gives a rapid pace saturation and is stickier to the maximum value than the DKL.
Wasserstein distance also looks at the difference between any two probability distributions. It is based on the optimal transport theory described by Yossi, Carlo and Leonidas 76 to identify the optimal transportation and allocation of resources. The distance gives relatively good values to DKL but has higher computational complexity.
DKL is a divergence, non-symmetrical measure that gives the informational difference between any two probability distributions, as pointed out by Nelken, Rani and Stuart. 77 A study by Erick, Petr and Eva 75 gave a better performance to other measures like histograms and Bhattacharya, except for Wasserstein, which had overhead in the computational complexity.
where DKL (E | |V) = 0 if and only if E = V.
DKL has aided divergence measurement in convolutional neural networks in various studies. Togami et al. 79 use DKL to evaluate the probabilistic output signal in a multi-channel speech-sourced deep neural network. The distributions come from the unsupervised method and the deep neural network supervised signal, while the input signal is generated from the microphone of a teleconferencing system. In a study by Shervan and Farshad, 8 an LBP-based colour-texture classifier is proposed. This classifier addresses the impulse-noise sensitivity in the LBP classifier. The classifier comprises the LBP extractor and the DKL, which rank features for users to achieve higher classification accuracy.
DKL is further used by Cao et al. 80 in their proposed use of a BERT (Bidirectional Encoder Representations from Transformers) model to eliminate the difference between the source and target domain features. The DKL is used to show the divergence between the two domains. The BERT model can map the features of both domains in a shared feature space.
Zhuang et al. 81 propose using supervised representation learning on deep transfer auto encoders. They utilise DKL to show the distribution differences between the source and the target domains.

Proposed features conflation approach
The proposed method utilises the concept of the conflation of probability distributions as introduced by Hill. 44 The proposed method architecture constitutes two parts: feature extraction with matching and the transfer learning part. The features are extracted using the target pre-trained CNN that comprises the first convolutional layer and an additional pooling layer (using average maximum pooling) to reduce the dimension of the feature map to reduce the computational complexity. The first layer extracts edge-like features and can be considered a filter bank approach such as Gabor filters or Maximum Response filters widely used in texture analysis. 51 CNN is used because it gives invariant discriminative features.
Once the features have been extracted, the feature vectors are reshaped into one-dimensional tensors for all the feature maps. These tensors are converted into probability distributions using the softmax function with temperature, as expressed in Equation (2). The softmax function works on the assumptions that: i Each reshaped tensor element w i is in the range of 0 and 1 w i ∈ [0, 1] ii The summation of w 1 , … , w n is 1 Therefore the softmax value p (w i ) for element w i is defined as where p (w i ) is the probability density or mass for the elements in the continuous distribution. This approach adds a temperature to the softmax to regulate each event's relevance and control the entropy in the probability distribution. 82 Equation (2) can then be expressed as shown in Equation (3).
These distributions create merged probability distributions for comparing the target and the source. Figure 1 shows a conceptual framework of the proposed approach.
In the comparison phase, the target images are compared with the filtered bank conflated probability distributions of the source images to determine the closest or the most suitable labels indicating their levels of domain data adaptation. The process is repeated until a final dataset is created that can then participate in the transfer learning process.
The proposed method acts as a filter network in the pre-processing data phase of the transfer learning process to improve domain data adaptation.

3.4
Steps for the proposed approach The following steps show the sequence of steps for the proposed approach.
• Step 1: Enter the source image and extract the features as feature map vectors.

•
Step 2: Reshape all the feature map vectors reducing the computational complexity; this converts a feature map vector x to a one-dimensional feature map vector.

•
Step 3: Using the softmax function with temperature T, create a standard probability distribution P(x), with each likelihood being the probability mass density. This process forms a probability distribution of the grey-level co-occurrence matrix (GLCM) image feature maps and linear binary pattern (LBP) properties.

•
Step 4: Conflate the feature map distributions for the source image Since we are dealing with the same unknown quantity presented in the form of distribution, we can merge the distributions using the conflation of distributions method, which minimises the loss of Shannon information as the information in the probability distributions is merged. 56  , while the inner loop gives a total complexity of O(nlog(n)). The same complexity is found in the target conflation function. Adding the extra steps in Figure 2C gives an additional complexity of O(nlog(n)). Adding up the complexities gives O(nlog(n)) as the total complexity of the algorithm.
The conflation of probability distribution is defined below; Definition 1. the conflation of discrete probability distributions. Given a finite number of probability distributions P 1 , … , P n , a merged probability Q is expressed in Equation (4) as; The conflation of probability distribution (&) is given by Equation (5) as; where f 1 , … , f n is the probability density functions of the probability distributions P 1 , … , P n while k and l are the data samples.

Definition 2.
The conflation of continuous probability distributions. Given a set of continuous probability distributions P 1 , … , P n , a conflation of probability distributions Q is expressed in Equation (6) as Equation (6) can be substituted using probability densities as shown in Equation (7) & Definition 3. The conflation of feature vectors' continuous probability distributions. Given an image with dimensionally reduced feature map vectors f ij with continuous probability distributions For a specific label l i , a set of source images X sl = (x i1 , … , x in ) each with its & probability distribution a further conflated &l i can be determined as expressed in Equation (9) &l &l i can therefore represent a specific label. The other labels can have their own &l ′ i s representing the classes' features in the source images.
When a target image is an input in the pre-trained network, the features are extracted and expressed as Equation (9), forming a & of the sample target image as shown in Equation (10) & ( where t indicates the target sample. The determined & for the target sample is then compared using DKL, as shown in Equation (11), with the & determined in the source classes.
where DKL(E||V) = 0 if and only if E=V. The finite set of DKLs K is expressed in Equation (12) The mean value of the images divergence is then determined as expressed in Equation (13); Finally, all the images whose value is lower than x are reserved as part of the dataset for use in the transfer learning process with the pretrained model, as illustrated in Figures 1 and 3.

Datasets
The experiments use two categories of datasets: The target datasets are outlined in Table 2.

Pre-trained model architectures
The experiments are performed on two architectures: VGG16 and MobileNetV2. VGG16: This model belongs to the family of VGG models. These VGGs are built on an analysis that increases the depth of the network with filters of 3 × 3. The image input sizes are 224 × 224 pixels which pass through a stack of convolutional layers.
MobileNetV2: Sandler et al. 83 introduced this model using depthwise separable convolution layers and pointwise convolutional layers instead of the regular convolutional ones. The architecture comprises 53 layers and takes input sizes of 224 × 224 pixels. It has 3.5 million parameters.

Experimental setup
The experiments are implemented using Tensor Flow 2.4.1 and trained on the PaperSpace cloud platform (Quadro P4000, 8CPU, 30GB RAM, GPU).

Feature extraction parameter settings
The pre-trained model's first convolutional layer is used to extract the features for utilisation in the GLCM and LBP descriptors. The images are converted into grayscale and resized to 224 × 224 for their suitability during training. The image resizing is also done at the extraction phase to ensure they conform to the final environment of the pre-trained model. The conversion of the distribution of the features to probability distribution uses the softmax function with a temperature of 0.5.

Training parameter settings
The pre-trained models are trained for 50-150 epochs with shuffled mini-batches of 8 images. The stochastic gradient descent (SGD) is used as the optimiser with a learning rate of 0.001, with the categorical cross-entropy as the loss function. Other parameters added on top of the network layers are a flattening layer, a dropout layer with a probability of 0.5 and a batch normalisation layer with softmax as a classifier. A dense layer with 4096 neurons is used in VGG16, while a dense layer with 64 neurons is used in MobileNetV2: fewer neurons in the VGG16 dense layer were shown to reduce accuracy performance. The batch normalisation normalises the set of activations in a layer and is a pixels value standardisation technique, as noted by Sergey and Christian. 84 The use of dropout as a regularisation element ensures that all the nodes in the layers have an equal chance of training the model by randomly zeroing out chosen neurons, as pointed out by Nitish et al., 85 to avoid leaving the process to a few heavily-weighted nodes that could dominate the process.

Textural feature analysis methods
Many methods exploit the first-order and second-order properties of grayscale and colour levels. The first-order properties include the mean, variance and other properties and are derived directly from the individual pixels without cross-comparisons. The second-order properties involve comparing two pixels at the same time. These second-order properties, therefore, investigate how one pixel at some reference location relates statistically to another pixel at a location displaced from the reference location. The experiments use two methods based on the second-order statistical properties to analyse the extracted features: Grey-Level Co-occurrence Matrix (GLCM) and Local Binary Pattern (LBP). The GLCM has recorded better results for a simple situation, as reported by Maillard. 86 Furthermore, it is easy to implement and has been shown to give outstanding results in large fields of applications 87 and good performance (processing time and complexity) when processing document images. 88 The LBP can combine the statistical and structural methods giving it improved analysis performance; it is easy to implement, has a low computational cost and is invariant to monotonic illumination changes.

4.4.1
Grey-level co-occurrence matrix This matrix represents the different combinations of the pixels' brightness levels or grey levels in an image. It is also called the Spatial grey level dependence matrix (SGLDM). 13 According to Stefania et al., 89 GLCM is a tool for obtaining second-order statistics. It was proposed in 1973 by Haralick. 3 According to Lan and Liu, 90 GLCM is a good feature descriptor since it obtains statistics reflecting the domain knowledge (spatial shape attributes) on grey direction, interval and change amplitude. It has been widely used in motif recognition, segmentation, biometrics, image retrieval, and behavioural analysis. The initial 14 features proposed by Haralick and Shanmugan 3 are grouped into texture visual, correlation, entropy, and statistical measures. Distance and orientation angle are the most important factors to consider when calculating the GLCM of an image. 51 The relationship between the pixels is looked into from the distance and the orientation angles perspectives. 91 Four GLCM properties are explored in this context: correlation, energy, dissimilarity and homogeneity. The spatial relationship between the pixels' grey levels and their statistical co-occurrences form the texture description. 51 • Energy/uniformity-this measures the intensity of the grey level in the image. It returns the sum of the squared elements in the GLCM.
• Homogeneity/inverse difference moment -refers to the inverse of the image contrast and measures the closeness of pixels' distribution in the GLCM to the GLCM diagonal.
• Correlation-this measures how a pixel is correlated (linear dependencies in the image) to its neighbour over the entire image.
• Dissimilarity -measures the distance between any two pixels in a region of interest or the local variations in an image.

Local binary pattern
This textural descriptor (visual descriptor) compares a pixel's grey level with its neighbours, assigning it a binary number. 66 It describes the local texture patterns in image 24 using 3 × 3 (though this has been extended to different sizes 50 ) blocks with the centre pixel as the threshold for the neighbouring pixels. 51 The LBP shows the correlation of the pixels within the block (local area) and is powerful enough to detect all the edges in an image. 48 A histogram is used as a feature vector or textural descriptor describing the signal via the distribution of the LBPs. With LBP, it is possible to encode the geometric features by detecting the edges, corners and other areas of the image. This encoding gives the feature vector representation, and LBP has proved to be an excellent unsupervised learning method. According to Zeebaree et al., 92 LBP is a widely used textural feature descriptor in texture detection and edge detection. According to Shekhar and Manoj, 93 LBP is one of the heavily utilised feature descriptors on different applications. For example, in bearing fault diagnosis, 94 transformer defect detection, 95 brain tumour classification, 96 and emotion recognition. 97 According to Ojala,98,99 it is widely used in many computer vision applications due to its computational efficiency in describing local texture structures, speed of computation, robustness to illumination variations and simplicity. 89 It is computationally efficient, withstanding monotonic illumination fluctuations, as noted by Ling and Liu. 100 LBP also improved feature extraction when used with CNNs, as noted by Tan et al. 101 In this study, the LBP features can capture good information related to image edges. Furthermore, this study's use of uniform LBP ensures reduced textural features overlapping between classes.

F I G U R E 4 LBP Decimal value extraction on an image
In this work, the LBP uniform method is used to acquire the exact probability density function (PDF) regardless of the orientation. Otherwise, the working is compromised due to having different feature inputs. According to Kaplan et al., 96 the uniform patterns from LBP descriptors show and explain the simple features such as spots, edges and corners. Niu et al. 97 point out that uniform pattern LBP contains primitive structural information for edges and corners. These are the same features that can be given by methods such as Gabor filters. A simple example of LBP in use is illustrated in Figure 4 below.
Once the vector representation of the features is defined, a comparison with another image is made, representing the identified features using histograms.
In using LBP, the image needs to be simplified (converted to grayscale by reducing its dimensionality). This process focuses on the local differences in the image(feature) luminance and binarisation. 65 LBP process steps • Step 1: For each image pixel (x, y), i choose some neighbouring pixels (P) at a radius R.

•
Step 2: Calculate the selected pixel (x, y) intensity difference to its neighbours P.

•
Step 3: Threshold intensities assign a 0 to neighbours with a smaller value than the centre; assign 1 to the neighbours equal to or greater than the centre. These values form a bit vector.

•
Step 4: Convert the bit vector into a decimal value (0-255) that replaces the centre pixel's (x, y) value. The values to be decimal-converted are read clockwise.
Equation (14) expresses the LBP descriptor for any image pixel.
where g c and g p represent the intensity of the centre pixel and neighbouring pixels, and p is the neighbouring pixels. At the same time, R is the radius of the pixels to the block of pixels.
Liu et al. 102 propose six LBP classes based on their roles in feature extraction: Traditional LBP, neighbourhood topology and sampling, thresholding and quantisation, encoding and regrouping, and combining complementary features and methods inspired by LBP. However, other variants exist, as reviewed by Nanni. 103

RESULTS AND DISCUSSIONS
This section presents and discusses the results of the performed experiments. The results are presented in three parts: part one compares three performance measures (accuracy, precision and recall) on both LBP and GLCM methods. Part two compares the performance to transfer learning baselines. Part three compares the performance of the probability distance measures, while the last part compares the complexities of the probability distance measures.

Performance measures on GLCM and LBP properties
In order to compare the performance of the textural analysis methods on the selected target dataset samples, three performance measures are considered: accuracy, precision and recall.

Accuracy comparison
Accuracy is the fraction of the correct predictions of a model. 104 It is expressed in Equation (15) as; where TP refers to true positive, TN refers to true negative, FP refers to false positive, and FN refers to false negative values. Tables 3 and 4 show the accuracy performance of the methods and the respective datasets. As seen in both Tables, the samples below (BLW) the mean DKL threshold give better accuracy than those more significant than the mean (ABV). This performance can be traced to the plots in Figure 5. Table 3 shows Caltech 256, which gives good accuracy on the GLCM correlation property, with the most negligible value being on the dissimilarity property. The same trend of high accuracy is repeated for the MIT Indoor and Stanford Dogs 120 datasets, with the dissimilarity properties giving lower accuracies. The accuracy of the LBP properties is lower in Caltech 256 GLCM and better in the Stanford Dogs 120 dataset.
Correlation, homogeneity and energy give the highest accuracies in the MobileNetV2 pre-trained model. The high correlation values in the Caltech 256 dataset are seen in this model, similar to the VGG16 model in Table 3. The LBP performs better in MobileNetV2 for Caltech 256, and GLCM properties outperform LBP for the case of MIT Indoor and Stanford Dogs 120, as seen in Table 4.

GLCM
where TP refers to true positive and FP refers to false positive values. Tables 5 and 6 show the precision performance of the methods and the respective datasets. As evident with the accuracy, the samples with lower divergence values give better precision, as seen in the Tables and Figure 6.
In Table 5, correlation and energy GLCM properties gave the best precision for all the datasets. The dissimilarity property gave the least precision values, as noted for Caltech 256 and Stanford Dogs 120. It is also noted that GLCM properties gave better precision to LBP property precision values for the MIT Indoor and Caltech 256 datasets.
In the three datasets in Table 6, the energy GLCM property gave the best precision values. LBP also gave high precision values in the MIT Indoor dataset. GLCM properties gave better precision in the Caltech 256 and Stanford Dogs 120 datasets.  (17) where TP refers to true positives and FN refers to false negative values. Tables 6 and 7 show the recall performance of the methods and the respective datasets. The recall for the lower-than-mean values still gives better results, as shown in Figure 7.
Correlation and energy give the highest recall values in the three datasets, with dissimilarity giving the least recall value. The GLCM properties also performed better than the LBP for the three datasets. In Table 8, the GLCM energy and correlation give better recall values to other GLCM properties. The LBP also performs better for the Caltech 256 and MIT Indoor, while GLCM performs better in the Stanford Dogs 120 dataset.

Comparison of proposed method with baselines
In order to observe the improvements from the proposed method, two baselines are used in this study: (a) comparison between lower DKL-valued samples to all the available data samples and (b) standard transfer learning baseline methods.

Comparison between low DKL samples and all samples
All the available samples would be utilised in a typical transfer learning scenario. We can compare the performance between the selected low DKL samples and the available samples to observe the performance difference. We compare this behaviour using the LBP measure, as shown in Figure 8A,B below. Further performance showing the superiority of adopting lower DKL-valued samples is observed in Tables 9 and 10, where lower DKL-valued samples performed better than using all the samples in the target task.

5.2.2
Performance with the transfer learning baselines The use of three datasets (below DKL, above DKL and all samples) was used on the following fine-tuning techniques: a. Standard fine-tuning: this baseline uses all the parameters of the pre-trained model on the three datasets. b. Fine-tuning last-k: these baselines fine-tune the last k layer(s) where k can be 1, 2 and 3 layers of the pre-trained model on the datasets. c. Feature extractor: this baseline uses the pre-trained model as a feature extractor, only adding a classification layer for the datasets. Table 9 below shows the performance of the four baselines using the LBP measure on the MIT Indoor dataset for 50 epochs. Table 10 shows the baseline accuracy performance of Stanford Dogs 120 using GLCM homogeneity. Tables 9 and 10 above show that the samples with lower DKL performed better in all the techniques, with the standard fine-tuning giving the best accuracies. Ge and Yu 51 note this observation 50 in their joint fine-tuning approach showing the importance of similar low-level characteristics identified in both domains, as illustrated in Figure 9. The feature extractor technique gave the least accuracy.

Probability distance measures comparisons of textual analysis properties
This study compares five divergence distance measures: Kullback-Leibler, Hellinger, Wasserstein, Jensen-Shannon and Bhattacharya distance. The DKL and Wasserstein give the best overall results of the three metrics in, as shown in Tables 9  and 10, while Hellinger gives the worst performance, as evident in Figure 10A-C.  As seen in Table 11, Wasserstein gave an excellent performance for the three measures compared to the other divergence measures. Although the Bhattacharya did not perform well for this dataset on MobileNetV2, DKL gave the second overall performance. Bhattacharya and Hellinger could perform better for this pre-trained model.
As noted in Table 12, the Bhattacharya gave the best accuracy and recall values, followed by DKL for the Stanford Dogs 120, given the LBP values. However, DKL gave the best precision value on the VGG16 model. Overall, DKL performs better, while Hellinger still performs the least. Tables 13 and 14 show the time (seconds) and memory (bits) taken by each divergence measure by comparing two samples from the source and target distributions.  Tables 13 and 14 show that Bhattacharya gives the highest time complexity when comparing a target sample to the source domain dataset, followed by Jensen-Shannon. The least complexity is found in DKL.

Probability distance measures computational complexity
A similar trend of higher time computational complexity is noted when using the GLCM properties, where Bhattacharya takes the most time comparing the probability distribution of source and target samples. The DKL takes the least time.

DISCUSSION
Correlation and homogeneity give the best accuracies for the two pre-trained models utilising the GLCM and LBP textural descriptors in Tables 3 and 4. The homogeneity contribution to good accuracy shows that many two neighbouring pixels have similar grey levels and the GLCM elements are along the diagonal with accuracy differences ranging between 1.96% and 6.18%. The homogeneity increases where the density of the edges is low and the distance between the textural patches. The reported accuracy shows that the large regions of samples below the mean DKL bear the same values (many two adjacent pixels have the same value). These values are found in the main diagonal and seem to change smoothly, as noted by Chaves, 106 while those of lesser importance are far away from the main diagonal. The correlation property seen in Tables 3, 4, 7 and 8 on accuracy and recall helps in determining better sensitivity, effectively representing the accuracy of the classifier. 41 The GLCM gave better precision values than LBP, although correlation and energy stood out in precision, as shown in Tables 5 and 6, giving the slightest difference margin between 0.12% to 9.59%. According to Park et al., 107 the GLCM energy property goes hand in hand with homogeneity, which looks at the texture's uniformity and simplicity. GLCM still performed better in recall with energy and correlation, giving good values, as shown in Tables 7 and 8. The recall metric is essential; it helps find all the relevant data points for the datasets and is the model's ability to get all the data points or relevance in dataset. 108 Other studies have also reported the excellent performance noted with the GLCM properties: Nurhaida et al. 109 and Zou et al. 110 Good performance of GLCM features shows less discriminating features in the image grey levels and can perform better than the CNN-based descriptors. 111 In a study by Tan et al., 102 GLCM properties were shown to improve the performance of the CNN model, especially in the cases of the limited dataset, a common scenario in transfer learning.
The Wasserstein and Bhattacharya divergence measures give good accuracies compared to the DKL in Tables 11 and  12. However, the two also gave higher computational complexity than the DKL for a single sample, as noted in Tables 13  and 14, with Bhattacharya giving an average overhead of 0.192 milliseconds for VGG16 LBP and 0.244 milliseconds for MobileNetV2 GLCM properties. If there are many samples, the complexity would likely increase by many folds forming the basis of selecting DKL in this study. Also, as explained in the proposed approach (Figure 2), the complexity of the proposed algorithm is O(nlog(n)) which could be considered high. For example, in the case of a Stanford Dogs 120 image, it takes 4.906 milliseconds to go through the conflation process for the VGG16 model; adding the 0.192 milliseconds used in Figure 2C totalling 5.098 milliseconds for one image to be added in the final dataset calculated using the Python timeit library. However, the proposed approach's applicability is subject to factors such as hardware in cases of large inputs, which would highly affect these pre-processing stages of fine-tuning. 112 As Naveed et al. 113 noted, the GLCM helps the pre-trained model learn better patterns from the data through the four descriptors. This description allows the selection of adapting data points that yield good performance. The selected data points utilise the lower informational difference in the pixels compared to the samples with DKL above the mean for both GLCM and LBP descriptors. From the tabular comparisons presented in this work, conflation has been used to identify samples of similar textural characteristics providing better samples that can easily or adapt much faster to a selected pre-trained model.
It is also noted that the GLCM dissimilarity property gave lower values in the three performance measures. The dissimilarity property indicates sharp changes in the grey levels: when the textural features change at the many pixels, there is contrast or dissimilarity of the pixels, then dissimilarity is high, and the features are abruptly changing.
In comparing the proposed approach to the existing baselines, an improvement of between 2.51% and 9.15% is noted between the proposed standard transfer learning and other baselines based on the proposed approach. The feature extractor technique gives the least accuracy performance, possibly because it involves freezing parameters which could lead to domain shift hence the poor performance.

CONCLUSION AND FUTURE DIRECTION
In this paper, we address target data adaptation in transfer learning by introducing the conflation of domain data conflation through textural features. The method uses a pre-trained model in extracting the textural features leveraging the actual deployment model and GLCM and LBP in features description, with tests VGG16 and MobileNetV2 on Stanford Dogs 120, Caltech 256 and MIT Indoor datasets. Experiments show that the samples with lower DKL values below the mean perform better than those with higher DKL values and still give state-of-the-art results. This performance shows that the quality of transfer learning samples dramatically affects how the selected models perform. Further studies can improve the proposed approach by incorporating the following: a. Use an optimal similarity graph to capture intrinsic local features or use the discrete cosine transform (DCT) to address the curse of feature dimensionality. b. Use reinforcement learning attention mechanism to extract salient textural feature patches and temporal features to prevent information loss in the convolutional layers during feature extraction and selection. c. Investigate the behaviour of other pretrained models and datasets to improve identifying quality and adaptive target samples.