F USION OF D EEP L EARNING A RCHITECTURES FOR E NHANCED T ARGET R ECOGNITION ON SAR I MAGES

ABSTRACT
In various applications of radar imagery, one of the fundamental problems is mainly linked to the analysis and interpretation of the images provided, in particular the recognition of moving and/or fixed targets. This task has become more difficult due to the large volume of radar data. This led to the use of automatic processing and target recognition methods. The aim of this study is to explore data fusion in SAR (Synthetic Aperture Radar) image classifiers. To this end, we propose a new approach to combine three CNN (Convolutional Neural Networks) architectures with several fusion rules. First, we perform a training process of three deep learning architectures; namely, the basic CNN, the Xception, and the AlexNet architectures. Then, two fusion techniques are proposed. The first one deals with the majority rule and the second uses a neural networks to combine the decision outputs obtained from three elementary classifiers to achieve the final decision. To evaluate and validate the proposed approach, the MSTAR (Moving and Stationary Target Acquisition and Recognition) dataset is used. The obtained performances of the fusion techniques improve the recognition rate with a final accuracy of 99.59% for the majority rule and 99.51 for the neural network-based rule, which surpasses the accuracy of each individual CNN.


INTRODUCTION
One of the fundamental problems in radar imaging is mainly related to the analysis and interpretation of images acquired in various applications, including recognition of moving and fixed targets.This task becomes more difficult to achieve when automatic processing methods and decision-making are concerned in regard to the large volume of radar data and speckle measurements [1]- [6].Regarding its global coverage as well as its weather independence, SAR (Synthetic Aperture Radar) imagery has great potentials for military and/or civilian surveillance.In order to leverage these potentials, ML (Machine Learning) can be used to automatically process large amounts of data for different goals, such as ATR (Automatic Target Recognition).For instance, target recognition in radar images presents an essential task for monitoring and surveillance of sensitive areas, such as military or/and civilian zones.
Several methods for target recognition and classification from radar images have been developed in the literature.Due to the absence of efficient feature presentation, interpretation and understanding of radar images, classification of radar images is a significant challenging task.An adequate featureextraction approach, which can abstract spatial information from radar images and improve classification accuracy, is required.Feature extraction and target recognition in images have been of great interest for many years.Many methods have been proposed by many researchers [1], [7].These include classical classification methods, such as Bayesian methods, SVM (Support Vector Machine) [8], AdaBoost (Adaptive Boosting) [9], decision trees, WSC (Weighted Sparse Classification), …etc.
[10]- [11].Although, these methods perform well in some situations, their performance degrades significantly in other situations and conditions.This has led researchers to adopt other more sophisticated approaches, such as ANNs (Artificial Neural Networks), which have a very efficient learning capability in a multitude of system-modeling problems [1].In recent years, we have seen the emergence of DL (Deep Learning) methods with several variants.These latter have given very satisfactory results in several application areas, such as cybersecurity, text analysis, visual and image recognition [12]- [16] and many more.Several research studies have explored the use of RNNs (Recurrent Neural Networks) in the context of SAR image classification, with particular emphasis on LSTM (Long Short-Term Memory) networks.LSTMs are specifically designed to handle sequences of data or time series, a feature particularly relevant to the sequential nature inherent in SAR data [17].To further enhance SAR classification performance, other researchers have delved into hybrid architectures that combine both CNNs (Convolutional Neural Networks) and RNNs (Recurrent Neural Networks).These hybrid approaches showed promising results in SAR image classification by enabling the simultaneous capture of spatial features and temporal dependencies [18]- [19].With the constant evolution of deep-learning network architectures and their maturity, many researchers have turned to the use of pre-trained DNNs (Deep Neural Networks) in the context of SAR image classification.This approach explores how features pre-learned by these networks can be judiciously employed to significantly increase the accuracy of SAR image classification [20].
Other researchers have attempted to expand the scope of the training dataset by implementing various operations on the images.These operations include rotations, translations, changes in scale and the introduction of noise [21]- [23].Furthermore, several super-resolution approaches based on DCNNs (Deep Convolutional Neural Networks) have been widely adopted to enhance image resolution.This increase in resolution has directly contributed to an improvement in classification accuracy [24].Additionally, there are studies that have employed data-augmentation techniques utilizing GANs (Generative Adversarial Networks) to generate synthetic images [25].Furthermore, in last years, advanced approaches have emerged, combining a DNN in order to improve and achieve more robust and accurate classification results.Furthermore, in recent years, advanced approaches have emerged, utilizing data fusion to significantly enhance SAR image classification.Among these methods, data fusion from multiple sensors has demonstrated a clear improvement compared to using data from a single sensor [26].Another promising approach lies in the use of model ensembles for SAR image classification, which explores how the aggregation of multiple models can substantially increase classification accuracy.Commonly adopted methods include ensemble learning, where multiple classifiers are trained.Their predictions are combined using various techniques, including voting, weighting and stacking methods [27]- [28].
In this context, the main objective of this research work is to explore new combined architectures of DL to accomplish the classification task, using two methods.The first method relies on fusing the output data from each classifier to make a final decision using the majority fusion rule.The second approach involves the use of ANNs (Artificial Neural Networks) to build a data fusion model, thus exploiting the advantages of each DLN (DL Network) architecture used in this study.Both methods will be evaluated and validated on the MSTAR (Moving and Stationary Target Acquisition and Recognition) dataset [29].
The contribution of this work deals with the study of two fusion techniques.The first one is based on the majority rule and the second one relies on the NN-based rule, in order to improve the recognition performance.
The remainder of this paper is organized as follows.In Section 2, we introduce the CNNs and their variants.In section 3, we present the proposed method.In Section 4, we detail the training, describe and discuss some experimental results performed on the MSTAR dataset.In Section 5, we conclude our work and give some future perspectives.

CONVOLUTIONAL NEURAL NETWORKS
CNNs are to date the most efficient models for classifying images and particularly radar images.An input image is provided in the form of a matrix of pixels.It has a two-dimensional array for SAR image for each channel/layer.For a multi-layer image, such as color images (3 layers for Red, Green and Blue) or multi-spectral images, the input image is provided as a multi-dimensional arrays.
The first part of a CNN is the convolutive step, which operates as an image-feature extractor.The image is passed through a series of filters or convolution kernels, creating new images called convolution maps.Some intermediate filters reduce the size of the input data by a maximum pooling operation.Finally, the last convolution maps are laid flat and concatenated into a feature vector, called the CNN code.This CNN code is the input of a layer called the FC (Fully Connected) layer which is a multi-layer perceptron.Its role is the combination of the characteristics of the CNN code to classify the image.The output is the last layer, called SoftMax layer, which uses a SoftMax function as an activation function [5].In this work, we consider three CNN architectures.The first is the basic CNN, the second is the AlexNet and the third is the Xception.

Basic CNN
The proposed basic CNN architecture is illustrated in Figure 1.It is formed by a successive independent layer of the convolution layer and subsampling layer.Theses layers are followed by an FC layer.To optimize the training process, transfer learning is adopted from the classification of the ImageNet as a source domain.Transfer learning is commonly used in learning applications in order to optimize the learning process and improve the performance of recognition tasks.For the common public architectures, a pre-trained network is used as a starting learning point for a destination task.Thus, the number of classes at the Soft-max layer of the pre-trained CNN network is adjusted.

AlexNet
AlexNet is a pivotal pre-trained CNN model introduced by Alex Krizhevsky in 2012 [30].It played a crucial role in advancing computer vision by learning hierarchical representations from 224x224 pixel RGB images.This deep architecture includes convolutional and max-pooling layers for feature extraction, starting from basic patterns and progressing to higher-level abstractions.Extracted features are flattened and processed through three FC layers to facilitate classification; with the final layer having units corresponding to dataset classes and using Softmax activation for probability outputs.In this study, we adapt the structure of the pre-trained network AlexNet, so that it can classify SAR images composed of 10 classes.We therefore act on the FC layer to keep only 10 neurons for the output layers.

Xception
Xception, derived from "Extreme Inception" is a DCNN (Deep CNN) architecture inspired by inception.It excels at capturing features of different sizes by isolating the learning of spatial and channel-wise features [31].Instead of combining these types of learning in a single convolution, Xception uses depthwise separable convolutions.It starts with spatial convolutions on each channel and then integrates information across channels using 1x1 convolutions.This design optimizes efficiency while maintaining depth, enabling Xception to efficiently capture both local and global features.It ends with FC layers for feature relationships and uses Softmax activation for multi-class classification, making it highly efficient and accurate for computer vision tasks like image classification.That is why in our study we adapt the output layer to align with the number of classes in the MSTAR dataset.

DATA FUSION
To obtain information that is as reliable and precise as possible in all observation conditions, by taking advantage of the complementarity and redundancy of elementary information, as shown in Figure 2, data fusion is used to improve the results.For the data-fusion rule of the different classifiers, the simplest and most popular approach is the majority-vote rule [26].We will also explore the possibility of applying NNs for the fusion of data from different classifiers.It should be noted that, unlike the majority rule which aggregates the different elementary decisions to reach the final decision, NNs are instead fed by the distinct outputs of the fully connected layer from each classifier.These outputs represent the results of the Softmax function, effectively providing us with the probabilities for each class.This approach is employed to construct a capable fusion rule, leveraging the capabilities of NNs to model any rule through learning.

Majority Vote
The majority vote consists in choosing the decision taken by the maximum number of methods.If there are m monitoring methods { 1 ,  2 , …   } such that the method   ∈ {1…} assigns the decision Di to the observation x, noted   () =  , where' i' is the class number and associates an indicator function   to each method, such as: The data fusion from 'm' classifiers is carried out according to the following expression: The majority-vote rule consists, therefore, in choosing the maximum of   .The correct class is the one that is most often chosen by the majority of the methods.

Rule Based on NNs
Exploring the possibilities offered by NNs for constructing classification rules through learning is a central challenge in the field of image classification.In this study, our focus lies in the classification of SAR images using three distinct approaches.The first involves a basic CNN classifier, while the other two leverage pre-trained architectures; namely, the Xception and AlexNet, renowned for their ability to capture high-level features.A critical step in this methodology involves the use of the Softmax output from each classifier to feed an MLP (Multi-layer Perceptron) NN.This approach skillfully combines the strengths of each classifier while minimizing potential errors.The Softmax function takes as input a vector of scores S, where each element Si represents the degree of belongingness of an example to a particular class.The output of the Softmax function, denoted as P, is also a vector of the same dimension as S, but each element Pi represents the probability of the example belonging to class i.The mathematical formula for the Softmax function applied to a score vector S is as follows: where, Pi is the probability that the example belongs to class i, Si the score associated with class i and M the total number of classes.
In our approach, we have chosen to utilize the Softmax outputs generated by the classifiers (Basic, AlexNet, Xception) in place of class labels.This decision was made to provide the MLP network with richer and more informative data.Consequently, in our approach, our training dataset comprises inputoutput vectors.The input vector is composed of the Softmax outputs from each classifier: where,  , is the output of the Softmax function of classifier i for class j, with i ranging from 1 to 3 and j from 1 to 10.
The output vector Y represents the actual outputs of the classes in the MSTAR training dataset, which represents a vector of 10 numerical values, with one value equal to 1, corresponding to the true class, while the others are equal to 0.
One of the learning algorithms for the MLP network is the BP (Backpropagation) algorithm, which is one of the supervised learning algorithms.The principle of this algorithm is based on modifying synaptic weights by propagating the error from the output layer back to the input layer through the intermediate layers.
Therefore, if we consider an MLP network with three layers, the BP algorithm can be listed as follows: Step 1 -Initialization We start by initializing the weights for the input layer to the hidden layer in matrix W1 and the weights from the hidden layer to the output layer in matrix W2.We also initialize the biases in the vectors b1 and b2.
Step 2 -Forward Propagation We calculate the output of the neurons in the hidden layer using the sigmoid activation function: We calculate the output of the neurons in the output layer using a linear activation function: Step

-Prediction
The output A2 contains predictions for the 10 classes.

Step 4 -Error Calculation
We calculate the error by comparing predictions A2 with the true values Y, using the MSE (Mean Squared Error), as follows: Step 5 -Backpropagation We use backpropagation to calculate the gradients of the error with respect to weights W1 and W2, as well as biases b1 and b2.
Step 6 -Weight Update We use the calculated gradients to update the network's weights Each time, we present a new input vector to the network with its associated output and repeat the calculation process from Step 2. Once we have presented all the examples in the training dataset, we calculate the sum of all the errors.
where, M is the total number of examples in the training dataset.
If we reach the desired error, the learning process ends.Otherwise, we repeat the algorithm from Step 2 with all the examples in the training dataset.
To make the learning algorithm faster, a term called Momentum, which takes into account the weight changes between two successive iterations, is added to Equations 14-17.
Thus, our goal is to improve the overall performance of SAR image classification, by judiciously combining the results of different classifiers, using the power of an MLP network.Subsequently, this MLP network will be trained on the MSTAR training dataset.The architecture chosen will be the one that gives the best performance in terms of correct-classification rates, obtained through meticulous tuning of hyperparameters to achieve an optimal configuration.

RESULTS AND DISCUSSION
In this work, we classified the SAR images from the MSTAR database, according to 10 class categories.In other words, we investigated the different architectures of DL for the classification of SAR satellite images.That is, we started with a simple architecture (Basic CNN) in which we looked for the optimal one.Next, based on the concept of transfer learning, we examined the pre-training architectures to use them in this work.Subsequently, we integrated the outcomes from the three CNNs through a dual approach.The initial method involves the use of the majority fusion rule, while the second method employs the capabilities of NNs to learn and implement the fusion rule via supervised learning.To begin, we introduce the software and hardware tools employed in our study.Subsequently, we detail the dataset utilized in this paper.Lastly, we assess the performance of the individual architectures, along with their combined counterpart, using the confusion matrix as a key reference.Specifically, we will focus on the accurate classification rate

Software and Hardware Tools
To conduct our experiments and effectively categorize all SAR images, we employed the Matlab software due to its user-friendly programming environment and extensive repository of image processing and ML resources.This encompasses artificial NNs and DL capabilities.For the DL framework, we utilized the DL toolbox.It is worth noting that the version of CUDA used in our experiments is 11.2.This combination allowed us to efficiently implement and train our CNN models, including the Basic CNN, earning AlexNet and Xception architectures.
Regarding the MLP network, we employed the NN toolbox to design and implement it.This approach provided the necessary tools for our network's development and evaluation; contributing to the overall success of our methodology.
Regarding our hardware setup, the configuration employed for training and assessing CNN models consisted of an Intel® Core i7 microprocessor running at 3.5GHz, coupled with 64GB of RAM.Additionally, a NVIDIA RTX 2080 TI GPU has been integrated into the system.

Database of SAR Images
To implement the proposed classification methodology utilizing various DL techniques, the MSTAR database of SAR images was employed.Widely utilized in the literature due to its public availability, the MSTAR dataset encompasses a collection of SAR images with a resolution of 0.3 m × 0.3 m, captured using an X-band spotlight SAR sensor [29].
Figure 3 illustrates the MSTAR database, housing SAR images of military vehicles and organized into 10 distinct class categories.Within the dataset, a total of 5165 images exist, with 2740 images allocated for training and 2425 images designated for the test dataset.The distribution of images across each target (class) can be found in Table 1.These images depict ten distinct classes of ground targets, encompassing entities, like tanks (T62 and T72), rocket launchers (2S1), trucks (ZIL131), armored personnel carriers (BTR70, BTR60, BRDM2 and BMP2), air-defense units (ZSU23/4) and bulldozers (D7).Notably, the images were captured under diverse conditions, spanning different aspect angles, depression angles and serial numbers.To evaluate the effectiveness of our approach, the MSTAR base, captured under SOC (Standard Operating Conditions), was considered [29].

Simulation Results for Basic CNN
Based on the basic-CNN architecture, the proposed approach determines the number of convolution layers, the number of filters, the size of the convolution filters as well as the subsampling step.As this is an optimization problem the inherent CNN hyper-parameters of which are difficult to get, we remedy it by giving initial values for all CNN hyper-parameters.Therefore, the training stage is to change a single hyper-parameter until the optimal value yields a better classification.The process is repeated for the other hyper-parameters until all optimal hyper-parameters are obtained.
Here, we used 158×158 input images and proposed a basic CNN architecture with initial hyperparameters as given in Table 2.Note that these hyper-parameters were obtained separately from each other.Therefore, this allowed us to locate the range of hyper-parameters to be investigated.After successive tests, we found the optimal hyper-parameters, shown in Table 3.These latter, which allowed us to achieve the best performances, are discussed in the next sub-section.Table 4 shows the performance evaluation of the basic CNN in terms of correct classification rate on the basis of SAR imaging test (MSTAR).The results were obtained through the training of the basic CNN, with the optimal hyper-parameters, achieving an overall correct-classification rate of 97.3%.It should also be noted that for the ZSU234 military machine, the correct-classification rate reached 100%.Table 4. Confusion matrix of the proposed basic CNN model.
In order to determine the architecture that yields a good classification of the SAR images, the next sub-section is devoted to the investigation of the two other pre-trained deep-learning architectures.

Simulation Results for the AlexNet and Xception Networks
AlexNet and Xception are transfer-learning networks that are based on pre-trained models.The model that we used has already been previously presented.
To classify new SAR images, we recycle a pre-trained network by replacing the three final layers with new layers adapted to the new dataset.Here, we need to change the number of classes to 10 and retain the parameters for the other layers.
Finally, a training operation is launched, so that the new network adapts with the new learning data.For a better classification, this is accomplished through the adjustment of the network parameters.It should be noted that the optimal hyper-parameters of the two pre-trained networks were found by successive tests.Tables 5 and 6 show the optimal hyper-parameters that were obtained via their Simulation results on the testing MSTAR dataset are shown in Tables 7 and 8. Table 7. Confusion matrix of AlexNet.Table 8.Confusion matrix of Xception.
Finally, based on Tables 4, 7 and 8, we achieved better performance with the Xception classifier, achieving a correct-classification rate of 98.39%.This was followed by the Basic classifier, which achieved a rate of 96.61% and lastly came the AlexNet classifier, which achieved a classification rate of approximately 95.12%.
After having trained several DL architectures for the classification of SAR images, it is important to consider the possibility of exploring the theory of data fusion for the classification of SAR images.To achieve this, we use a parallel architecture, in which we apply the fusion rule via the majority vote as well as the use of NNs for the construction of a new data-fusion rule.

Simulation Results with Decision Fusion Techniques
In this sub-section, we use a parallel architecture consisting of three classifiers; namely, the basic CNN, the two pre-trained networks, AlexNet and Xception, which are passed to the fusion center.In the following part, we will evaluate the performance of the two data fusion approaches; namely, the majority rule and the rule based on NNs.The two techniques will be evaluated via the MSTAR testing dataset.

Majority Voting Rule
In this technique, the basic decisions from each classifier are transmitted to the fusion center for a final determination using a majority voting rule for data fusion.It is worth noting that in instances where the three classifiers yield differing classes, the rule will yield a random decision.The performance of this architecture was appraised using the MSTAR test database.The resulting confusion matrix is detailed in Table 9, showing an average global accuracy rate of 99.53%; indicating a gain of 1.14%.

Neural Network
The second technique which we used is data fusion using ANNs (Artificial NNs), exploiting the ability of NNs to model any rule by the supervised learning.Therefore, we investigated several architectures of NNs; the MLP architecture has been retained.In our investigation, we used the Levenberg-Marquardt learning algorithm to train the network.Levenberg-Marquardt algorithm is a widely used The configuration of this network was iteratively fine-tuned through multiple experiments, involving the minimization of a cost function and adjustments of hyper-parameters, such as the number of hidden layers, the number of neurons in the hidden layers, the activation functions, the error metrics and the Momentum parameter, using the MSAR training dataset.This iterative process continued until achieving optimality, as demonstrated in Figures 4 and 5, which illustrate the optimal structure of the MLP network and the loss function.
Upon completing the training phase, simulation results on the MSAR test dataset yielded an average overall accuracy rate of 99.44%, as depicted in the confusion matrix of Table 10, indicating a significant improvement.Table 10.Confusion matrix of NN-based data-fusion center.

Execution-time Analysis for the Proposed Methods
We have taken into consideration the execution time of the two proposed methods, recognizing the crucial importance of speed in radar detection and SAR image-classification applications, which often operate in real time.Firstly, it is important to note that the three used DL architectures require significant time during the training phase.However, once this phase is completed "in offline" mode, we measured the execution time of the three architecture; namely, the basic CNN, the AlexNet and the Xception.The execution times that we obtained for classifying a single image are 0.908 ms, 2.76 ms and 11.48 ms, respectively.
It is essential to note that these execution times are calculated for classifying a single image.In a parallel implementation, we consider the longest time among these three times, which is that of the Xception architecture, as the next step can only occur when the three outputs from the three classifiers are ready.
Regarding the execution time for data fusion, we observed that using the rule requires approximately 15.8 µs, while an MLP network requires about 20.86 µs.It is worth noting that these times are practically comparable, with a slight advantage for the majority rule.
In summary, while the three used DL architectures require a significant amount of time during the training phase, once this phase is completed, our fusion methods present execution times that are compatible with the requirements of real-time radar detection and SAR image-classification applications; thus demonstrating their practicality and efficiency.

Performance Comparison of Different Methods
In order to facilitate the comparison of our findings with other DL methodologies assessed on the MSATR test dataset, we present a summary of methods and their corresponding average classification rates in Table 11.Particularly, our approach, employing a parallel architecture with a fusion center, achieves the highest average accuracy in correct classification.This outcome underscores how our strategy of data fusion effectively leverages the unique strengths of individual classifiers while mitigating classification errors.Note that our majority voting-based fusion rule exhibits a superior performance.Additionally, it is worth highlighting that the data-fusion technique centered on ANNs also delivers highly commendable results.

CONCLUSION
The primary goal of this study was to employ a combined architecture consisting of three classifiers and a data-fusion center for the purpose of classifying SAR satellite images.To achieve this, a basic CNN architecture was utilized.This architecture underwent optimization by fine-tuning the network parameters, including hyper-parameters, until reaching the optimal configuration.In addition to this, two other pre-trained CNN networks were employed.For these networks, the final three layers were adapted to facilitate SAR image classification.A similar optimization process was applied to determine the best architecture.To merge the outputs of the distinct classifiers, a fusion center was implemented.Two techniques were employed for this fusion; namely, the majority-voting rule and the utilization of NNs to construct a learning decision rule.Through simulations conducted on SAR images, with a focus on the MSTAR database, the effectiveness of data fusion in SAR image classification was demonstrated.Notably, the majority fusion rule technique exhibited an exceptional performance, achieving an accurate-classification rate of 99.53%.As future works and in order to validate the performance of the proposed approach, extensive experiments on other datasets should be implemented with other fusion techniques.

Figure 1 .
Figure 1.Architecture of a basic CNN model.

Figure 3 .
Figure 3. Military vehicles from MSTAR database and their corresponding optical images.

Table 1 .
Training and testing MSTAR dataset.

Table 5 .
Optimal hyper-parameters for

Table 9 .
optimization algorithm for training NNs.It adjusts the network's weights to minimize the error between the predictions and the actual values.Confusion matrix of majority-voting rule.