Different real-valued representations are chosen as the input for representative deep learning models to assess their impact on the performance of CNN-based PolSAR image segmentation. In the following section, the selected real-valued representations that have been used as input for CNN models in previous studies, along with their categorization, are presented. Subsequently, two specific CNN architectures are introduced that are employed in this study for PolSAR image segmentation.
2.1. Real-Valued PolSAR Data Representations
In general, the signal measured by a PolSAR system is represented by the complex-valued scattering matrix
that describes the transformation, induced by observed scatterers, of the transmitted plane wave vector
into the received plane wave vector
:
According to the Backscattering Alignment convention, the backscattered signal is measured in the same polarization plane as the transmitted signal. Following this convention and assuming the commonly used monostatic configuration where the transmitter and receiver share the same antenna, the reciprocity theorem holds. This theorem dictates that for reciprocal scattering media, the equality
holds true. The scattering matrix provides a complete characterization of deterministic (point-like) scatterers. However, in observing natural environments, multiple scatterers typically contribute to a single resolution cell, resulting in distributed scatterers. To describe this type of scattering behavior, a statistical formalism is required, which is represented by the averaged coherency matrix
. This matrix is obtained by spatial or temporal averaging of the outer product of the scattering vector
, which is derived from the vectorization of the scattering matrix
using the Pauli basis:
Here
denotes the complex conjugation and
the matrix transpose.
is a hermitian positive semi-definite matrix with real-valued power elements on the diagonal and complex-valued cross-correlations in the upper and lower triangle. Thus, the matrix is defined by nine real-valued parameters.
is widely used as a comprehensive descriptor of distributed scattering phenomena, making it a common starting point for PolSAR image segmentation. However, to utilize real-valued CNNs for this task, the complex-valued matrix
associated with each pixel needs to be transformed into a real-valued representation. This transformation can be achieved by either concatenating the nine independent real-valued parameters of
, combining the entries into a feature vector, extracting physically interpretable features based on target decomposition methods, or employing a combination of these approaches.
Table 1 provides an overview of various real-valued PolSAR data representations used as input for CNN models in the existing literature.
The first category, labeled as
-elements, consists of representations derived directly from the coherency matrix
. The most frequently utilized representation is called
T9_real_imag, which is obtained by simply concatenating the real-valued diagonal elements and the real and imaginary parts of the upper triangle off-diagonal elements. In contrast, the equally computationally straightforward
T9_amp_pha representation, which incorporates the amplitude and phase information of complex-valued elements, is less commonly used. This is despite the findings of [
47], suggesting the potential of this representation for improved suitability in the CNN-based segmentation. To capture the significance of polarimetric phase differences, in this study, the representation
T9_amp_pha is contrasted with
T9_amp. The latter representation, used as CNN input in [
58], solely considers the amplitude of the complex-valued elements.
The second category, labeled as
-feature vector, is represented by the six-dimensional PolSAR data representation proposed in [
20]. It is successfully used as an input for CNN models in several works [
48,
49,
50,
51,
52,
53]. The feature vector is composed of the total scattering power of all polarimetric channels in logarithmic scaling (
), normalized power ratios (
and
) and relative correlation coefficients (
):
According to [
20], this representation is tailored explicitly for neural networks due to the constrained value range of power ratios and correlation coefficients. However, the superiority of this representation over the previously mentioned ones has not yet been analyzed.
The third category, labeled as
Target decomposition, comprises representations based on physically interpretable polarimetric features extracted using coherent or incoherent target decomposition approaches. The Pauli representation is based on the coherent decomposition of the scattering matrix
into matrices that correspond to a surface (
), double-bounce (
), or volume scattering (
) mechanisms:
The intensities
,
, and
quantify the power scattered by the associated scattering mechanism. The Pauli decomposition is widely used to represent and visualize polarimetric information as a color image and is employed as CNN input in [
31]. The analysis of the scattering matrix
, which is only capable of characterizing deterministic scatterers, is insufficient to describe distributed scatterers accurately. Therefore, the analysis of a second-order descriptor, such as the coherency matrix, has to be considered. Model-based incoherent decomposition methods, such as the four-component Yamaguchi decomposition proposed in [
10] and the three-component decomposition proposed in [
60], are commonly employed to represent the coherency matrix as a combination of matrices corresponding to elementary scattering mechanisms. In this study, the resulting proportions of surface, double-bounce, and volume scattering obtained from these methods are combined into three-channel images, serving as input for CNNs. Further physically interpretable features that are frequently employed for land cover classification include entropy
H, which measures the degree of randomness in the scattering process, mean alpha angle
, which relates to the dominant scattering mechanism, and anisotropy
A. These features are based on the eigenvalue decomposition of the coherency matrix proposed in [
8]. An attempt to enhance the CNN-based PolSAR image segmentation using these features is described in [
54].
Since representing PolSAR data with only three polarimetric features causes a loss of information, several researchers propose to combine multiple features to obtain a more comprehensive data description. These types of representations are assigned to the last category
Combination in
Table 1. To improve the performance of CNN-based segmentation, Gao et al. ([
61]) propose a dual-branch network to combine the six-dimensional feature vector (
Zhou) with Pauli decomposition parameters. Another combination is proposed in [
62] that is composed of the amplitudes of the coherency matrix and scattering mechanism contributions according to the Yamaguchi decomposition. Chen and Tao [
21] assemble the well-known features
H,
A,
and the total scattering power (
) with so-called null angle features, which describe the target orientation diversity. The included null angle features, are defined as:
In [
21], the
ChenTao representation is compared to the
T9_real_imag representation as an input for a patch-based CNN classification. It achieves slightly higher accuracies on two PolSAR datasets. To specifically investigate the usefulness of the two null angle parameters, the analysis presented here also includes a comparison to the representation based on the feature subset
H_A__span. A further approach to combine polarimetric features within a CNN-based segmentation is proposed by Qin et al. in [
56]. Their analysis identifies a suitable CNN input set of 16 components, which includes elements of
, the smallest eigenvalue
of
,
A and
as well as
. Adapting to that, this work includes the PolSAR data representation referred to as
Qin in the investigation. Finally, a representation denoted as
Mix is evaluated, which combines the
ChenTao and
Qin representations with the power of elementary scattering mechanisms based on the Yamaguchi decomposition. The
Mix representation consists of 23 components, making it the most extensive representation in this comparison. Some of the extracted features, such as the amplitudes of elements in
and scattering power contributions, exhibit distributions that deviate significantly from a normal distribution and possess a high dynamic range. These characteristics can potentially have a detrimental effect on the accuracy of segmentation, considering that CNNs are optimized for processing normally distributed RGB images. To mitigate this issue, the affected features are logarithmically scaled to approximate a normal distribution and reduce the dynamic range. Another crucial step in preprocessing is the standardization of features. This process converts the various feature values into a common unit, making them comparable. In image processing, z-standardization is typically employed, which involves subtracting the mean and dividing by the standard deviation. However, it is important to consider that the extracted PolSAR features may contain outliers, such as unusually high backscatter values caused by artificial structures. These outliers can greatly influence the mean and standard deviation, making z-standardization inadequate for achieving balanced feature scales. To address this issue, a robust standardization method based on the median and quantile range is employed:
where
denotes the image data of component
c,
denotes the value of one pixel in
, and
and
denote the 98th and 2nd percentiles of
, respectively.
To form CNN input layers based on the selected real-valued representations, the corresponding scaled components are combined into a multi-channel image , where . Here, W and H represent the width and height of the image, and C represents the number of channels corresponding to the number of components in the representation.
It should be noted that concatenating the individual components into a multi-channel image deviates partially from the approach used in the cited works. The choice of the concatenation approach is motivated by its widespread use in the literature and its compatibility with established CNN models. Additionally, it ensures a fair comparison of the representations, as the same CNN architecture with an identical number of trainable parameters can be used for each representation.
2.2. CNN Segmentation Models
To analyze the suitability of the previously described PolSAR representations for CNN-based segmentation, two different CNN models are used, which are presented below. Both models are based on U-Net, a FCN introduced in [
24] that consists of an encoding and a decoding path. The general structure of U-Net is visualized at the top of
Figure 1. Within the encoder, contextual image features are extracted by the repeated application of convolution, activation, and aggregation. Thereby, the spatial dimension is reduced, while the feature dimension, which encodes the relevant image information, increases. Within the decoder, the resulting feature maps are gradually spatially up-sampled until the original spatial image dimension is reached, and a combination of convolution and softmax activation realizes a pixel-wise classification. To retain fine-scale spatial information, skip connections concatenate feature maps from the encoding path (blue) to up-sampled feature maps in the decoding path (grey).
As an encoder, an arbitrary CNN model can be used for feature extraction. Since the extracted features provide the basis for the final class separation, the choice of the CNN model significantly influences the segmentation result. In this work, two common models, a Residual Network (ResNet) proposed in [
64] and an EfficientNet proposed in [
65], are used as encoders. Both are visualized in
Figure 1. ResNets are specific deep CNNs that are proven to be very powerful for the classification of RGB images. The high performance of this model has been achieved by introducing the concept of residual learning, which enables the training of very deep networks. Instead of direct mapping functions that transform an input into the desired output, residual functions are learned with reference to the layer inputs. This is realized using residual blocks (highlighted in orange in
Figure 1). These contain so-called shortcut connections that perform an identity mapping of the input, which is added to the output of subsequent layers. In this work, ResNet-18, whose architecture is detailed in
Figure 1, is used as the encoder of the U-Net model.
EfficientNet was proposed by Google in 2009 [
65]. By scaling the network’s depth, width, and resolution in a structured way, good performance can be achieved with low resource consumption. The network is mainly built using mobile inverted bottleneck (MBConv) blocks introduced in [
66] that are shown in
Figure 1. This building block includes a so-called inverted residual block, which first employs point-wise convolution (
convolution) to project an input feature map into a higher dimensional space, subsequently performs depth-wise convolution, and finally projects the resulting feature map back to a lower dimensional space. The input feature map is added to the output feature map using a residual shortcut connection. The inverted residual block is extended by a Squeeze and Excitation (SE) block ([
67]) consisting of a global pooling and two Fully Connected layers (FCs). This block allows the recalibration of channel-wise feature responses, which enables the network to provide higher weighting to relevant features. In this work, EfficientNet-b0, shown in
Figure 1, is used as an encoder of the U-Net model.
In addition to the presented architectures, many other networks can be used as encoders, such as VGG, Inception, MobileNet, etc. ResNet-18 and EfficientNet-b0 were chosen as encoders in this work because they offer a good compromise between classification accuracy and the number of trainable parameters.