Two-Stream Deep Fusion Network Based on VAE and CNN for Synthetic Aperture Radar Target Recognition

Usually radar target recognition methods only use a single type of high-resolution radar signal, e.g., high-resolution range profile (HRRP) or synthetic aperture radar (SAR) images. In fact, in the SAR imaging procedure, we can simultaneously obtain both the HRRP data and the corresponding SAR image, as the information contained within them is not exactly the same. Although the information contained in the HRRP data and the SAR image are not exactly the same, both are important for radar target recognition. Therefore, in this paper, we propose a novel end-to-end two stream fusion network to make full use of the different characteristics obtained from modeling HRRP data and SAR images, respectively, for SAR target recognition. The proposed fusion network contains two separated streams in the feature extraction stage, one of which takes advantage of a variational auto-encoder (VAE) network to acquire the latent probabilistic distribution characteristic from the HRRP data, and the other uses a lightweight convolutional neural network, LightNet, to extract the 2D visual structure characteristics based on SAR images. Following the feature extraction stage, a fusion module is utilized to integrate the latent probabilistic distribution characteristic and the structure characteristic for the reflecting target information more comprehensively and sufficiently. The main contribution of the proposed method consists of two parts: (1) different characteristics from the HRRP data and the SAR image can be used effectively for SAR target recognition, and (2) an attention weight vector is used in the fusion module to adaptively integrate the different characteristics from the two sub-networks. The experimental results of our method on the HRRP data and SAR images of the MSTAR and civilian vehicle datasets obtained improvements of at least 0.96 and 2.16%, respectively, on recognition rates, compared with current SAR target recognition methods.


Introduction
Synthetic aperture radar (SAR) target recognition is a development of radar automatic target recognition (RATR) technology. Because of the all-weather, all-day and long-distance perception capabilities of SAR, SAR target recognition plays an important role in both military and civil fields [1][2][3][4]. SAR target recognition is urgently required, given the overwhelming amount of SAR data available, and the SAR target recognition has been a wide concern at home and abroad.
As a type of data widely used in RATR [5][6][7][8][9][10], high-resolution range profile (HRRP) data can be simultaneously obtained with the corresponding SAR image in the procedure of SAR imaging. HRRP data obtained from SAR echoes have been widely used for target recognition [11,12]. Figure 1 shows the relationship between HRRP data and the SAR image based on the classical range-Doppler algorithm (RDA) [13]. HRRP data is a 1D distribution of the radar cross section and can be obtained by the modulo operation after the range compression of the received SAR echoes. HRRP target recognition receives widespread attention in the RATR community due to its relatively few complexities in signal acquisition [1][2][3][4]. A SAR image is the 2D image of the target derived by coherently processing high-range resolution radar echoes and conducting translational motion compensation by means of range cell migration correction (RCMC). The SAR images are easier and more intuitive to understand, being interpretable for human visual perception, as each pixel value reflects the surface microwave reflection intensity. Feature extraction is an important part of target recognition. The quality of the extracted features directly affects the performance of the target recognition. The development of HRRP data and SAR images in RATR has both gone through a process from the extraction of manual features to the extraction of depth features [1][2][3][4]14,15], which also leads to better RATR recognition performance. However, most of the existing RATR methods based on HRRP data and SAR images only use a single type of data. According to Figure 1, due to the different generation mechanisms, the information contained in the HRRP data and the SAR images is not exactly the same. Since the HRRP data and the SAR images can only represent the original SAR echoes from one aspect, using the above two data sources together can lead us to obtain a more complete information representation of the original SAR echoes. Modeling a complete interpretation using only unimodal data is theoretically insufficient. Therefore, to reveal more complete information, we underwent to formulate a novel framework to fuse the characteristics obtained from modeling HRRP data and the SAR image for radar target recognition. To the best of our knowledge, this is the first time that HRRP data and SAR images have been comprehensively utilized for radar target recognition.
In this paper, we propose an end-to-end two-stream fusion network. The first stream takes the HRRP data as the input, and draws support from the VAE, a deep probabilistic model, to effectively extract the latent probabilistic distribution features. The other stream takes the SAR image as its input. In this stream, a light weight CNN, LightNet, is utilized to extract 2D visual structure features. A fusion module with an attention mechanism is exploited to integrate the different characteristics extracted from two different signal types into a global space, to obtain a single, compact, comprehensive representation for radar target recognition that reflects target information more comprehensively and sufficiently. In the fusion module, the attention weight vector learned automatically is used to adaptively integrate the different characteristics, controlling the contribution of each feature to the overall output feature on a per-dimension basis, remarkably improving recognition performance. Finally, the fused feature is fed into a softmax layer to predict the classification results. More specifically, the main contributions of the proposed two-stream deep fusion network for target recognition are as follows: 1.
Considering that both the SAR image and the corresponding HRRP data, in which the information contained are not exactly the same, can be simultaneously obtained in the procedure of SAR imaging, we apply two different sub-networks, VAE and LightNet, in the proposed deep fusion network to mine the different characteristics from the average profiles of the HRRP data and the SAR image, respectively. Through joint utilization of these two types of characteristics, the target representation is more comprehensive and sufficient, which is beneficial for the target recognition task. Moreover, the proposed network is a unified framework which can be end-to-end joint optimized.

2.
For the integration process on the latent feature of VAE and the structure feature of LightNet, a novel fusion module is developed in the proposed fusion network. The proposed fusion module takes advantage of the latent feature and the structure feature to automatically learn an attention weight. Then, the learned attention weight is used to adaptively integrate the latent feature and the structure feature. Compared with original concatenation operator, the proposed fusion module can achieve better recognition performance.
The rest of this paper is arranged as follows. Section 2 gives the related works of RATR based on HRRP data and SAR images. Section 3 introduces the novel two-stream fusion network. In Section 4, experiments based on the measured radar dataset and their corresponding analysis are presented to verify the target recognition performance of the proposed two-stream fusion method. Finally, the conclusions are presented in Section 5.

Radar Target Recognition
Traditional radar target recognition methods are mostly based on manual feature extraction. These hand-crafted features are inappropriate if there is not sufficient prior knowledge on their application. Meanwhile, these features are mainly lower-level representations, e.g., textural features and local physical structural features, which cannot represent higher-level, abstract information.
Recently, deep learning has made progress by leaps and bounds in computer vision tasks due to its powerful representation capacity.
In HRRP target recognition, owing to the successful application of deep neural networks in various tasks, there are several deep neural networks have been developed on HRRP data. Part of the works focus on selecting suitable networks for HRRP recognition, such as stacked auto-encoder (SAE) [14], denoising auto-encoder (DAE) [5] and recurrent neural network (RNN) [16,17]. There are also some works which focus on how to use HRRP data reasonably, such as using the average profile of HRRP data and sequential HRRP data [18]. Nevertheless, those above-mentioned neural networks for HRRP recognition only gain the point estimations of latent features, which lack descriptions of the underlying probabilistic distribution. Considering the HRRP data do have the statistical distribution characteristics as [19][20][21][22][23][24] described, probabilistic statistical models are exploited to reveal the description of underlying probabilistic distribution, which can use prior information according to a solid theoretical basis, and an appropriate prior will enhance model performance. Meanwhile, probabilistic statistical models possess robustness and flexibility in modeling [25]. At present, several probabilistic statistical models have been developed to describe HRRP [23,[26][27][28]. Nevertheless, traditional probabilistic models need to preset the distribution patterns of data, such as Gaussian distribution and Gamma distribution, which are relatively simple and have some limitations in their data fitting ability (the ability of fitting the original data distribution) [29]. In addition, since traditional probabilistic models are based on shallow architectures with simple linear mapping structures, they are only good at learning linear features. However, different from traditional probabilistic models, the VAE [6,30,31] introduces the neural network into probabilistic modelling. As we all know, neural networks stack nonlinear layers to form a deep structure. This nonlinear capability in VAE makes the data fitting more accurate, which can reduce the performance degradation caused by inaccurate data fitting. The deep structure of VAE can mine deep latent features of data with stronger feature separability. Because there is an explicit latent feature to represent the distribution characteristics of data in VAE, the latent variable is often directly used as the representational information of the sample for classification tasks, including HRRP target recognition [7,32,33], and has achieved good performance.
At present, VAE is the prevailing generative model. Meanwhile, the generative adversarial network (GAN) is also well known as a popular generative model. Although the VAE and the GAN both belong to generative models and they are usually mentioned at the same time, they are different in many aspects. In VAE, there is an explicit latent feature to represent the distribution characteristics of the data. Therefore, in the practical application of VAE, in addition to the common sample generation, the latent variable of VAE is often directly used as representational information of the sample for the classification and recognition tasks. However, restricted by the inherent mechanism of GAN, there is no explicit feature which can represent the distribution characteristics of data. The application of GAN focuses on the related fields of sample generation and transfer learning.
In the target recognition of the SAR images, auto-encoder (AE) [1,3] and the RBM [2], two widely used unsupervised deep neural network structures, are also employed and have better performance. Among deep neural networks, CNN has become the dominant deep learning approach, as in the VGG network [34], or ResNet. CNN architectures are usually comprised of multiple convolutional layers (followed with activation layers), pooling layers, and one or more fully connected layers. In CNNs, the local connection and weight share in the convolution operation, and the pool operation can effectively reduce the parameters and complexity, resulting in the invariance to translation and distortion which makes the learned features more robust [4]. Another advantage of the CNNs is that they can utilize convolution kernels to extract 2D visual structure information from the apparent to the abstract through layer-by-layer learning. This visual structure information plays a vital important role in image recognition [35][36][37].
In this paper, VAE and CNN are used as sub-networks for the HRRP data and the SAR image, respectively.

Information Fusion
In recent years, with the development of sensor technology, the diversity of information forms, the huge quantity of information, the complexity of information relations, and demand of timeliness, accuracy and reliability in information processing are unprecedented. Therefore, information fusion technology has been developed rapidly. Information fusion denotes the process of combining data from different sensors or information sources to obtain new or precise knowledge on physical quantities, events or situations [38].
According to the abstract level of information, information fusion methods can be divided into three categories: data-level fusion [39], feature-level fusion [40] and decisionlevel fusion [41]. Data-level and decision-level fusion are the two most easily implemented information fusion methods, but their performance improvements are also limited. Recently, it has also an important research topic to comprehensively and effectively use a variety of information of radar data, such as multi-temporal [42] and multi-view [43] data, to achieve better model performance. An inverse synthetic aperture radar (ISAR) target recognition method based on both range profile (RP) data and ISAR images was proposed, based on decision-level fusion of the classification results of RP data and ISAR images [44]. Feature-level fusion is the most effective method of information fusion, and it is often used as an effective means to improve performance in deep learning research. Several works focusing on image segmentation also use feature level fusion to fuse multi-level features [45][46][47]. However, these works fuse the features of the same data at different scales, while this paper fuses the features extracted from different data through their respective feature extraction networks.

Two-Stream Deep Fusion Network Based on VAE and CNN
The framework of the proposed two-stream deep fusion network for target recognition is depicted in Figure 2. As shown in Figure 2, the framework is briefly introduced as follows.

1.
Data acquisition: as can be seen from Figure 1, the complex-valued high-range resolution radar echoes can be obtained after range compression of the receiving SAR echoes. Then, the HRRP data are obtained through the modulo operation. At the same time, based on the complex-valued high-range resolution radar echoes, the complex-valued SAR image is obtained through azimuth focusing processing. Then, the commonly used real-valued SAR image for target recognition can be obtained by modulating the complex-valued SAR image. 2.
VAE branch: based on the HRRP data, the average profile of the HRRP is obtained by preprocessing. Then, the average profile is fed into the VAE branch to acquire the latent probabilistic distribution as a representation of the target information.

3.
LightNet branch: the other branch takes the SAR image as input and draws support from a lightweight convolutional architecture, LightNet, to extract the 2D visual structure information as another essential representation of the target information. 4.
Fusion module: the fusion module is employed to integrate the distribution representation and the visual structure representation to reflect more comprehensive and sufficient information for target recognition. The fusion module merges the VAE branch and the LightNet branch into a unified framework which can be trained in an end-to-end manner.

5.
Softmax classifier: finally, the integrated feature is fed into a usual softmax classifier to predict the category of target. In Section 3.1, Section 3.2, Section 3.3, Section 3.4, Section 3.5, Section 3.6 some important components, including the acquisition data of the HRRP data and the real-valued SAR image from high-range resolution echoes, the VAE branch, the LightNet branch, the fusion module, the loss function and the training procedure, are introduced concretely. Figure 1 in the Introduction gives the data acquisition procedure of the HRRP data and real-valued SAR image from received the SAR echoes based on RDA. The received SAR echoes are obtained from the radar-received signals through the dechirping and matched filters. The RDA SAR imaging algorithm can be divided into two steps: range focusing processing and azimuth focusing processing. The range focusing processing includes, in turn range fast Fourier transformation (FFT), range compression and range IFFT. Then, the high-range resolution radar echoes can be obtained. The azimuth focusing processing includes, in turn, the azimuth FFT, RCMC, azimuth compressing and azimuth IFFT.

Acquisition of the HRRP Data and the Real-Valued SAR Image from High-Range Resolution Echoes
Based on the high-range resolution radar echoes, the HRRP data are obtained through the modulo operation. At the same time, based on the complex-valued high-range resolution radar echoes, the complex-valued SAR image is obtained through azimuth focusing processing. The azimuth focusing processing includes, in turn, the azimuth fast Fourier transformation (FFT), range cell migration correction (RCMC), azimuth compression and azimuth IFFT. Then, the commonly used real-valued SAR image for target recognition can be obtained by modulating the complex-valued SAR image. According to the introduction of the SAR imaging procedure, we can see that the complex SAR image is obtained using the high-range resolution radar echoes. Furthermore, given the complexity of the SAR image, the corresponding high-range resolution radar echoes and HRRP data also can be acquired [48,49].
Considering the mechanism inherent in the modulo operation, the modulo operation for generating HRRP data and the operation of the module for generating real-valued SAR images have different information loss characteristics. Therefore, although the HRRP data and the real-valued SAR images used in the proposed method keep a one-to-one correspondence, they cannot convert to each other anymore due to the operation of the module. In other words, the information contained in the HRRP data and the real-valued SAR images used in the proposed method are not exactly the same. The HRRP data and the SAR images can only represent the original high-range resolution radar echoes from one aspect each. Therefore, the features extracted from the HRRP data cannot be derived from the SAR images with certainty.

The VAE Branch
Before radar HRRP statistical modeling, there are some issues should be considered in practical application. The first is the time-shift sensitivity of HRRP. Centroid alignment [50] is commonly used as the time-shift compensation technique. We can eliminate amplitude-scale sensitivity through amplitude-scale normalization, such as L 2 normalization. Considering the target-aspect sensitivity [15,32], it has been demonstrated that the average profile has a smoother and more concise signal form than the single HRRP, and can better reflect the scattering property of the target in a given aspect-frame. From the perspective of signal processing, the average profile represents target's stable physical structure information in a frame [8,9,51]. One important characteristic of the average profile is that it can depress the speckle effect of HRRPs. Furthermore, the average profile also suppresses the impact of the noise spikes and the amplitude fluctuation property.
According to the literature [8,10,51], the definition of the average profile is and r is the dimension of HRRP samples. The VAE holds that the sample space can be generated by the latent variable space, that is, sampling latent variables from a simpler latent variable space can generate the real samples within the sample space. The latent variable in VAE can describe the distribution characteristics of the data. The framework of VAE is illustrated in Figure 3. Given the observations x AP n N n=1 with N samples, the VAE exploits an encoder model with input x AP and outputs the mean, µ, and the standard deviation, σ, of the latent variable, z. Assuming the encoder model can be represented as f VAE_E with parameter ϕ, which is also known as an inference model, q ϕ z x AP , the encoder of VAE can be formulated as follows: (2) Figure 3. Architecture of the VAE with Gaussian distribution assumption.
Here, the reparametrization trick is adopted to sample from the posterior z ∼ q ϕ z x AP using the following: where ε ∼ N (0 , I), and represents an element-wise product. Then, with the latent variable z as the input, the decoder model f VAE_D with parameter θ outputs the reconstruction sample,x AP , which can be formulated as follows: The decoder model is also known as a generative process with a probabilistic distribution: p θ x AP |z .
The goal of the VAE model is to use the arbitrary distribution q ϕ z x AP to approximate the true posterior distribution p θ z x AP . Formally, as shown in Equation (5), the KL divergence is used to measure the similarity between q ϕ z x AP and p θ z x AP p θ z x AP , as follows: is the variational evidence lower bound (ELBO) [52,53]. For the given observations, p θ x AP is a constant. Thus, minimizing the KL q ϕ z x AP p θ z x AP is equivalent to the ELBO maximization. Therefore, the loss of VAE on the data x AP can be written as follows: In Equation (7), the first term can be regarded as reconstruction loss, which also can be written as follows: This teaches the decoder to reconstruct the data and suffers a cost if the output of decoder cannot reconstruct the data accurately. Usually, we can use a l 2 -norm between the original data x AP and the reconstructed datax AP as the reconstruction loss. The second term is the KL divergence between the encoder's distribution q ϕ z x AP and the prior p θ (z). Typically, if we let the prior over the latent variables be the centered isotropic multivariate Gaussian p θ (z) = N (z; 0 , I), the KL divergence in Equation (7) can be computed as follows: where µ j and σ j represent the jth element in the µ and σ, respectively, and J denotes the dimensionality.
In practice, the encoder model is implemented with a three-layer, fully connected neural network. The units in the encoder model are 512, 256, and 128 respectively. Moreover, the decoder model is also implemented with a three-layer, fully connected neural network. The units in the decoder model are 128, 256 and 512 respectively. The dimensions of the latent variable z, the mean µ and the standard deviation σ are set to 50.

The LightNet Branch
Among deep neural networks, CNNs have made remarkable progress due to their characteristics of local connection and weight sharing. CNNs take advantage of convolution kernels to extract 2D visual structure information through layer-by-layer processing. Many excellent convolutional network architectures, such as VGG and ResNet, have come to dominate many fields. Nevertheless, considering the limited data volume, these abovementioned networks still have a larger number of parameters for the task of SAR target recognition. Therefore, we applied a lightweight CNN, called LightNet, which has very few parameters and can achieve approximate performance.
The LightNet architecture is mainly comprised of convolution layers and pooling layers. Following each convolution layer, there is a rectified linear unit (ReLU) as an activation function and a batch normalization layer which allows the network to use much higher learning rates and be less careful about initialization [54]. The architecture of the LightNet is shown in Table 1. In the LightNet, there are only 5 convolutional layers. The kernel size in the first convolutional layer is 11 × 11, which is a larger kernel size, for gaining a larger receptive field. In the following three convolutional layers, the kernel sizes are 5 × 5. Considering that the fully connected layer in the original LightNet, which is usually used to transform feature maps to a feature vector at the final position in the network, has many parameters, we use a convolutional layer with a 3 × 3 kernel and no padding to replace a common fully connected layer to generate a feature vector from the feature maps. The convolutional layer has fewer parameters than the fully connected layer. Compared with global pooling, the convolutional layer can not only learn more abstract features but also adjust the dimensions of the feature vector. The LightNet network considers the SAR image, x I , as an input to extract the 2D visual structure information, m, as another essential representation of the target information. Assuming f LNet represents the LightNet with parameter ψ LNet , then the LightNet branch can be formulated as follows:

Fusion Module
In the feature extraction stage, a VAE model is employed on the HRRP data to extract the latent probabilistic distribution information as a feature, and a lightweight LightNet is used on the SAR image to extract the structure features. In the neural network framework, the most common feature fusion approaches are the concatenation operation and elementwise addition. The concatenation operation combines multiple original features according to the feature dimensions to generate a fused feature, and the dimension of the fused features is equal to the sum of the original feature dimensions. Although it is simple to realize, the dimension of the fused feature is relatively high, which brings a greater pressure on the subsequent classifiers, including the increase in the number of parameters and the cost of optimizing the parameters. The element-wise addition is also a common feature fusion method. Based on the element-wise addition, the fusion feature is obtained by adding the original features, element by element, which keeps the dimension consistent with the original features and requires smaller parameters on the subsequent classifiers than the concatenation operation. In essence, element-wise addition assumes that the importance of different features is the same.
To reflect the target information more comprehensively and sufficiently, a novel fusion module was exploited to integrate the latent feature obtained from VAE and the structure feature obtained from LightNet, which can also merge the two streams into a unified framework with end-to-end joint optimization. The proposed fusion module is a further extension on the element-wise addition inspired by the gated recurrent unit (GRU) [55]. On the one hand, we use an attention weight vector, not a single value, to integrate the different features. More clearly, in the feature fusion, we no longer think that each dimension in a feature vector shares the same weight, but each dimension of the feature vector has its own weight coefficient. The influence of features that contribute more to the target task on the fusion features is increased by considering the differences in the importance of each feature more carefully. Likewise, the influence of features that contribute less to the target task on fusion features is weakened. On the other hand, compared with traditional, empirically set weight values, the attention weight vector is learned automatically according the target task, which can perform an adaptive adjustment of feature weights with the samples and categories. Figure 4 shows the flowchart of the fusion module. At first, the latent feature, z, and the structure feature, m, are fed into fully connected layers, respectively, to generate the features Z ∈ R d×1 and M ∈ R d×1 : where features Z and M have the consistent dimension d which was set to 50 for the experiments. ReLU(·) denotes the ReLU activation operation, and W Z and W M are the parameters in the fully connected layers, respectively. Here, the fully connected layers are applied not only to further map two features into a global space, but also to make the dimension and order contain consistent correspondence for subsequent element-wise addition, i.e., the relationship between the ith feature in Z and the ith feature in M is a one-to-one correspondence relationship. Then, the latent feature z and the structure feature m are concatenated into a long feature vector, and a fully connected layer is used on the long feature vector to learn the attention vector α ∈ R d×1 : where sigmoid(·) denotes the sigmoid activation operation, and W α denotes the parameter. Drawing support from the sigmoid activation, the value in the attention vector is in the range [0, 1]. Here, the attention mechanism is derived from the selective attention behavior of the human brain when processing information. The human brain scans the total information quickly to get the focus area, and then invests more attention resources in this area to obtain more detailed information of the target task, while suppressing other useless information. This method has greatly improved the means of screening highvalue information from a large quantity of information. Similar to the selective attention mechanism of human beings, the core goal of the attention mechanism we used was to select the information that as most critical to the current task from a large quantity of information. Therefore, a fully connected layer with activation was used to simulate the neurons in the human brain. The input of the fully connected layer was all of the sample information, i.e., all the features of the sample. By using the fully connected layer to sense all the information, we could determine the focus area/features, and then invest more attention on the focus features while suppressing other useless information. That is to say, we can know where to focus and the degree to which to focus from the output of the fully connected layer. Therefore, the output of the fully connected layer is called attention weight vector.
Finally, the attention vector α was used as a weight to sum the Z and M. Due to the value of α being in range [0, 1], the value in 1 − α is also in range [0, 1]. The attention vector can be regarded as a weight vector which controls the contribution of the feature Z to the overall output of the unit. In contrast, considering the weight normalization, the weight of feature M can be directly obtained by the operation 1 − α without an extra learning process. More concretely, the attention vector α is element-wise multiplied with the feature Z, and the vector 1 − α is element-wise multiplied with the feature M, and then the element-wise sum operation is used to integrate these two features: where ⊗ represents the element-wise multiply operation, 1 is a vector whose elements are all valued one, and F represents the fusion feature. Assuming f f usion represents the overall fusion module with the parameter ψ f usion , then the fusion module can be summarized as follows:

Loss Function
Following the feature extraction stage and the fusion module, the fusion feature F was fed into a softmax layer to predict the classification results {ŷ n } N n=1 , which can be formulated as follows: where f c represents a usual softmax classifier with parameter ψ c . The supervised constraint ensures that the prediction labelŷ n is closed to the true label y n via the cross-entropy loss function, as follows: where K represents the number of classes. Therefore, the total loss function of the proposed deep fusion network for target recognition is a combination of L label and L VAE (described in formula (10)) as follows:

Training Procedure
Based on the total loss function, L total , the backpropagation algorithm where we used stochastic gradient descent (SGD), was used for the proposed network for end-to-end joint optimization. The total training procedure of the proposed network is outlined in Algorithm 1. 6. Sample random noise {ε n } N 1 n=1 from standard Gaussian distribution for re-parameterization. 7. With x b AP as input, generate the latent distribution representation z using Equations (2) and (3), and then generate the reconstructionx AP based on Equation (4). 8. With x b I as input, generate the structure information m using Equation (11).

Experimental Data Description
Experiments were carried out based on the HRRP data and SAR images of the moving and stationary target acquisition and recognition (MSTAR) dataset, which was collected through a HH polarization mode SAR sensor working in the X-band with 0.3 × 0.3 m resolution in spotlight mode [56]. The MSTAR dataset, a measured benchmark dataset, is widely used for evaluating SAR target recognition performance. The MSTAR dataset includes ten different ground military targets, i.e., BMP2 (tank), BTR70 (armored vehicle), T72 (tank), BTR60 (armored vehicle), 2S1 (cannon), BRDM (truck), D7 (bulldozer), T62 (tank), ZIL131 (truck) and ZSU234 (cannon). Among them, BMP2 and T72 have variants in the test stage. The depression angles of the samples for each target category are 15 • and 17 • , and the aspect angles cover a range from 0 • to 360 • . Referring to the existing literature, this paper focuses on two experimental scenarios: three-target data SAR target recognition and ten-target data SAR target recognition. The specific details of the experimental data setting on the above-mentioned two experimental scenarios are listed in Tables 2 and 3, respectively. Optical image examples of the ten different targets are shown in Figure 5, and the corresponding SAR image examples are listed in Figure 6.    For the MSTAR data, we use the complex-valued SAR images provided by the U.S. Defense Advanced Research Projects Agency and the U.S. Air Force Research Laboratory to get the high-range resolution radar echoes in reverse without information loss, in accordance with the reference [11]. And then, based on the high-range resolution radar echoes, the HRRP data can be generated, as shown in Figure 1. The average profile examples of the generated HRRP data of the ten different targets are listed in Figure 6. The real-valued SAR image was directly obtained by a modulo operation on the complex-valued SAR images.

Evaluation Criteria
For the quantitative analysis, we use two widely used criteria, namely, the overall accuracy and the average accuracy, as the evaluation criteria to evaluate target recognition performance.
Tr i Q i (20) where Tr i represents the number of test samples recognized correctly in class i, Q i represents the total number of test samples in class i, and N C represents the number of classes.
The higher the values of the overall accuracy and the average accuracy, the better the performance of target recognition method.

Three-Target MSTAR Data Experiments
In this section, we discuss the effectiveness of the proposed method on the three-target MSTAR data. We gave the confusion matrix of the proposed deep fusion network on three-target MSTAR data in Table 4. The confusion matrix is a widely used performance evaluation method for target recognition. In a confusion matrix, each row represents the actual category, while each column is the predicted category, and the elements denote the probabilities that the targets are recognized as a certain class. In particular, the elements on the diagonal represent the recognition accuracy. From Table 4, it is easy to see that the accuracy on BTR70 was 0.9898, the accuracy on T72 was 0.9880, and the accuracy on BMP2 was 0.9642, which shows the proposed method has better recognition performance. In order to further validate the efficiency of the proposed method, we compared the proposed method with some traditional SAR target recognition methods, i.e., directly applying the amplitude feature of the original SAR images, principal component analysis (PCA), the template matching method, dictionary learning and JDSR (DL-JDSR) [57], sparse representation in the frequency domain (SRC-FT) [58], and Riemannian manifolds [59]. Moreover, the proposed method was compared with other deep learning-based target recognition methods without data augmentation, as seen in Table 5. The compared deep learning-based target recognition methods include the original auto-encoder (AE), denoising AE (DAE), linear SVM, the Euclidean distance restricted AE [3] (Euclidean-AE), the VGG convolutional neural network (VGGNet), A-ConvNets [60], the early feature fusion of a model-based geometric hashing (MBGH) approach and a CNN approach (MBGH+CNN with EFF) [61], compact convolutional autoencoder (CCAE) [62], ResNet-18 [63], ResNet-34 [63] and DenseNet [64]. Figure 7 shows the intuitional accuracy results of the proposed method and the above-mentioned compared methods. Table 5 lists their detailed accuracies with the three-target MSTAR data and their overall and average accuracies.  From Figure 7 we can clearly see that, compared with the original image method and the PCA, template matching, DL-JDSR, SRC-FT and Riemannian manifold methods, our proposed method performs better on both overall accuracy and average accuracy. The proposed method also yields higher overall and higher average accuracy than the compared deep learning methods, i.e., AE, DAE, Euclidean-AE, VGGNet, A-ConvNet, MBGH+CNN with EFF, ResNet-18, ResNet-34 and DenseNet. As shown in Table 5, for the BMP2 and the T72 types with variants in the test stage, the accuracies of the proposed method attained 0.9642 and 0.9880, respectively, which outperformed all other compared target recognition methods. For the BTR70 type, which does not contain variants, the template matching method and VGGNet could correctly recognize all test samples, at the same time, the accuracies of the proposed method, DL-JDSR and the A-ConvNet were also 0.9898, which is very close to 1. In terms of overall accuracy and average accuracy in Table 5, two comprehensive evaluation criteria, we can see that the proposed method is at least 0.96% and 0.52% higher, respectively than other compared methods.

Ten-Target MSTAR Data Experiments
In this section, we evaluate the target recognition performance of the proposed method with the ten-target MSTAR data. Similar to the Section III-C, the confusion matrix is shown at first in Table 6. From the Table 6 we can see that the accuracy of all target types, except T72, was over 0.97. The best accuracy is shown in ZIL131, where the test samples were all correctly classified. The accuracies on BTR60, D7 and ZSU234 were close to 1, and the worst accuracy was over 0.94.   As shown in Figure 8 and Table 7, our proposed method outperforms all the other compared methods. Especially, for the first nine types, i.e., BMP2, BTR70, T72, BTR60, 2S1, BRDM, D7 and T62, the proposed method yielded the highest accuracy. For the ZSU234 type, the A-ConvNet method has the highest accuracy and the proposed method followed closely, with an accuracy of 0.9964. In terms of overall accuracy, we can see that the proposed method is at least 4% higher than the other compared methods. The proposed method is about 4% higher in terms of average accuracy. In order to gain a better understanding of the network's behavior and prove that the fusion of HRRP data and SAR images is beneficial to SAR target recognition, an ablation study is always adopted to see how each component affects the performance where one or more certain components of the network are removed or replaced. Therefore, in this sub-section, several controlled experiments are designed. Except for certain examined components, the rest of settings remain consistent. The ablation study experiment results on the three-target MSTAR data are summarized in Table 8. In Table 8, the addition denotes the element-wise addition of the fusion operation and the concatenation denotes the concatenation fusion operation, which are usually adopted as a fusion module in multistream network architectures [66,67]. From rows 1 and 2 in Table 8, it can be observed that recognition accuracy using only the HRRP data through the VAE model was 0.8813 for overall accuracy and 0.8399 for average accuracy. Moreover, the recognition accuracy using only the SAR images through LightNet was 0.9487 for overall accuracy and 0.9612 for average accuracy. The VAE model and the LightNet can extract different features from different domains, and both of them gain good recognition performance. Nevertheless, when comparing rows 1 and 2 with rows 3, 4, 5 and 6, it can be observed that the fusion of the latent feature of the HRRP data obtained from the VAE and the structure features of the SAR images obtained from LightNet are beneficial for reflecting target information more comprehensively and sufficiently to achieve better recognition performance. Furthermore, as shown in rows 3, 4, 5 and 6, on the basis of fusing VAE and LightNet, the performance improvements brought by the different fusion modules were different. The decisionlevel fusion module had 0.9278 overall accuracy and 0.9357 average accuracy, which were lower than the accuracy only using LightNet. In fact, simple decision-level fusion can indeed get robust performance but finds it difficult to obtain the best performance. The utilization of the element-wise addition module had an 0.9568 overall accuracy and 0.9664 average accuracy; the concatenation module had an 0.9648 overall accuracy and an 0.9715 average accuracy, and the proposed fusion module produced markedly superior recognition accuracy of 0.9780 for overall accuracy and 0.9807 average accuracy. From the comparison we can see that the proposed fusion module achieved the best fusion performance.

Feature Analysis
The quantitative performance analysis has been evaluated through comparisons to existing methods and detailed ablation studies to reveal the effectiveness of the proposed method. In this sub-section, we adopt t-SNE [68] to visualize the fusion feature learned by the proposed method, the features learned through the VAE model and the LightNet, as well as the amplitude feature of the original SAR images in Figure 9 on the three-target data. From Figure 9, it can be observed that the features learned by the proposed fusion network show a better feature distribution, in which each class gathers more closely and the margin between them is much more distinct when compared with other features.

FLOPs Analysis
In Table 9, we give the floating point of operations (FLOPs) for the VAE branch, the LightNet branch, the proposed network and the VGG network for comparison.

ResNet-34 DenseNet
FLOPs 3.4 × 10 5 2.3 × 10 7 2.33525 × 10 7 5.14 × 10 9 1.9 × 10 9 3.6 × 10 9 5.7 × 10 9 By analyzing the calculation principle of the convolutional layer, we get that the computational complexity of one convolutional layer is C in,cl C out,cl K cl 2 M out,cl 2 , where C in,cl and C out,cl are the number of channels in the input and output feature map of the convolutional layer, K cl is the size of the convolution kernel, and M out,cl is the size of the output feature map. For one fully connected layer, the computational complexity is N in. f l N out, f l , where N in. f l is the number of input nodes of the fully connected layer and N out, f l is the number of output nodes. Therefore, according to the architecture and details of the LightNet branch shown in Table 1, we obtain the FLOPs for the LightNet branch as 2.3 × 10 7 by substituting the relevant parameters into the formula of computational complexity. Similarly, according to the introduction of the VAE and the detail of its architecture presented in Section II-B, we can obtain the FLOPs for the VAE branch as 3.4 × 10 5 . In the proposed network, besides the LightNet branch and the VAE branch, there is a fusion module with 1.25 × 10 4 FLOPs. Therefore, the total FLOPs of the proposed network are 2.33525 × 10 7 . By substituting the relevant parameters in the VGG network, ResNet-18, ResNet-34 and DenseNet, we can obtain that the FLOPs for VGG were 5.14 × 10 9 , 1.9 × 10 9 , 3.6 × 10 9 and 5.7 × 10 9 , respectively.
It can be seen from Table 9 that although the VGG network, ResNet-18, ResNet-34 and DenseNet have deeper architectures and require more FLOPs, the recognition performance of these methods on all datasets was lower than that of the proposed method.

Experiments on Civilian Vehicle Dataset
The civilian vehicles dataset was provided by the U.S. Air Force Research Laboratory. The sensor collecting the civilian vehicles data is a high-resolution Circular SAR and the wave band is X-band. The civilian vehicles data includes ten different civilian vehicle targets, i.e., Toyota Camry, Honda Civic 4dr, 1993 Jeep, 1999 Jeep, Nissan Maxima, Mazda MPV, Mitsubishi, Nissan Sentra, Toyota Avalon and Toyota Tacoma. The aspect angles cover from 0 • to 360 • , and the depression angles of the samples for each target category is 30 • . The HH channel as used for training and the VV channel was used for testing. The number of training and test samples in each category were 360. Importantly, the provided data were high-range resolution radar echoes. For the proposed method, the HRRP data and real-valued SAR images were obtained according to the procedure shown in Figure 1.
We compared the performance of the proposed method with some SAR target recognition methods, including directly applying linear SVM to the original SAR images, PCA followed by linear SVM, the template matching method, DL-JDSR, AE, DAE, the VGG network, A-ConvNet, MBGH+CNN with EFF, ResNet-18, ResNet-34 and DenseNet in Figure 10 and Table 10. Here, due to the number of test samples for each category being the same, the overall accuracy and the average accuracy are the same, too, as formulated in Equations (19) and (20). Thereby, only the total accuracy is listed in Table 10. As shown in Figure 10 and Table 10, our proposed method outperforms all the other compared methods. Especially for the 1993 Jeep, 1999 Jeep and Toyota Avalon, the proposed method yielded the highest accuracy. For the other categories, the accuracy of our method was not the highest, but it was also among the best. And in terms of total accuracy, we can see that the proposed method was at least 2.1% higher than the other compared methods.

Conclusions
In this paper, considering that both SAR images and the corresponding HRRP data, in which the information contained is not exactly the same, can be simultaneously obtained in the procedure of SAR imaging, we formulated a novel end-to-end two stream fusion network framework to fuse the characteristics obtained from modeling HRRP data and SAR images for radar target recognition. The proposed fusion network contains two separated streams in the feature extraction stage, one of which takes advantage of a VAE network to acquire latent probabilistic distribution from the HRRP data and the other using LightNet to extract 2D visual structure information based on the SAR images. The proposed fusion module was utilized to integrate the above-mentioned two types of different characteristics to reflect target information more comprehensively and sufficiently, and it could also merge the two streams into a unified framework with end-to-end joint training. The experimental results based on the MSTAR dataset and the civilian vehicle dataset show that the proposed two-stream fusion method has greater performance advantages than some conventional target recognition methods and other deep learning-based target recognition methods, showing the superiority of the proposed method.
Although the proposed target recognition method offers a significant improvement in performance, there is also a limit in speed. Since the proposed method contains two branches, the running time of the proposed method on one test sample is a little higher than that of a single branch. In the future, we will further explore the increase in speed draw support through parallel computing and algorithm optimization.

Conflicts of Interest:
The authors declare no conflict of interest.