1. Introduction
When a serious natural disaster strikes, residential buildings are damaged in high probability, which poses a great threat to property and life [
1,
2]. According to the statistics, building collapse is one of the main causes of human casualties after natural disasters [
3]. Rapid and accurate building damage assessment prior to rescue actions can support effective emergency rescue planning and save more lives [
4], and it is essential for Humanitarian Assistance and Disaster Response (HADR) [
5]. It has become an indispensable reference for rescue actions after natural disaster strikes nowadays.
Remote sensing has the advantage of acquiring ground target information over a large area and has been widely used to observe disaster areas. In recent years, with the development of remote sensing technology and satellite constellations, remote sensing data have become more easily accessible when disasters occur. High-resolution optical images, synthetic aperture radar (SAR) and LiDAR data are frequently used in the interpretation of disaster-affected areas [
6,
7,
8,
9,
10,
11]. When a disaster occurs, it becomes an important way to assess the post-disaster situation [
12]. High-resolution optical images are more widely used [
13] since the real building conditions can be easily interpreted from them, which provides a powerful source of information to assess the extent and scope of the damage.
A number of building damage assessment methods based on high-resolution optical remote sensing images have been proposed.
According to whether pre-disaster or post-disaster remote sensing images are used, building damage assessment methods can be divided into the following two categories: (1) methods that use only post-disaster images; and (2) methods that use both pre-disaster and post-disaster images. For post-disaster image-based methods, damage assessment is usually conducted by image segmentation [
14]. However, the outlines of buildings in post-disaster images may be blurred by the strike of disaster, and the characteristic of the buildings often changes dramatically. Collapsed buildings lose their regular geometric shapes, where the regular texture feature distribution does not apply, resulting in building assessment errors. For the above problems, if pre-disaster image of intact buildings is available and used, it can help locate the damaged parts of buildings [
15]. In [
16], a two-stage framework is proposed for damage detection from both pre-disaster and post-disaster images. The networks of two stages share the same weight and are responsible for building localization and damage classification, respectively. In this paper, we design our network mainly based on this idea.
According to the image processing methods used, building damage assessment methods can be divided into the following three categories: (1) manual visual interpretation, (2) traditional machine learning-based methods, and (3) deep learning-based methods. The manual visual interpretation has good specificity and relatively high accuracy, but it takes a lot of time, and the effectiveness of the interpretation depends on human experience and the time spent; thus, it may lead to missing the best time for rescue. Therefore, it is necessary to automatically perform damage assessment from remote sensing images. Many machine learning algorithms with shallow structures have been developed, such as the methods based on Support Vector Machine (SVM) [
17] and Random Forest (RF) [
18]. These methods have reached relatively high precision within a small number of parameters. Annabella et al. apply SVM classifier on the changed features of buildings, including color, texture, correlation descriptors, and statistical similarity, to detect the destroyed building objects after the earthquake [
19]. However, the manually extracted features from remote sensing images by these methods are not sufficiently general, and in most cases, they are only valid for specific situations. These models are difficult to transfer to other geographic areas [
20]. Moreover, a high level of a priori knowledge, as well as a large amount of time, is required for the feature extraction. Therefore, it is not practical to apply traditional machine learning algorithms to building damage assessment quickly after the disasters strike.
Deep learning has been extensively developed in recent years, and the methods based on Convolutional Neural Network (CNN) have achieved high-level accuracy on various tasks [
21,
22,
23,
24] due to its powerful automatic feature extraction capability. Many researchers have made great progress in building damage assessment when introducing CNN. Koch et al. [
25] propose a metric-based method with siamese CNN networks for one-shot classification tasks. An energy function of a weighted distance between the twin feature vectors is utilized. The tied parameters between the siamese network allow the same metric computed when two distinct images are fed into the network. Liu et al. [
26] use an end-to-end framework to detect damaged buildings from remote sensing images by combining CNN and Recurrent Neural Network (RNN). Nex et al. [
27] use multi-level classification instead of binary classification to asses the extent of damage. In [
28], for an image patch, all buildings at the edges are occluded to allow the model to focus on the central buildings. Zhan et al. [
29] apply U-Net [
30], which is originally used for medical image segmentation, on two-phase SAR images for building structure change detection. Based on CNN, U-Net consists of a symmetric Encoder–Decoder in which down-sampling and up-sampling are introduced. U-Net proposes skip connections that enable the integration of low-level features and high-level features, thereby improving the representation of spatial information. With the capability to restore high-resolution information and high-level feature extraction, different kinds of U-Nets have been widely applied to change detection and damage assessment tasks. Yang et al. [
31] construct the Recurrent-CNN (RCNN) U-Net, which can extract spatial context and exploit rich low-level features. RCNN performs region of interest (ROI) segmentation before feature extraction on the ROI instead of the whole image. To exploit the correlation between pre- and post-disaster images, Xiao et al. [
32] propose a siamese U-Net to process the task of building segmentation and damage classification simultaneously. However, the actual receptive field of the CNN-based network is much smaller than the theoretical receptive field [
33]. In other words, the representation ability of the network is limited. In the process of damage assessment, the limited receptive field constrains the contextual information that the network can utilize, which has a significant impact on the assessment performance.
The Transformer architecture [
34] has the advantage of a global receptive field and has the potential to overcome the above problems of CNNs. It initially achieved great success in Natural Language Processing (NLP). The Vision Transformer (ViT) introduces Multi-head Self-Attention (MSA) into vision tasks [
35]. Recently, ViT and its variants have shown powerful global relational modeling capabilities and outperform the state-of-the-art CNN models in many tasks [
36,
37,
38]. Compared to CNN with a limited receptive field, Transformer keeps the sizes of input and output unchanged and effectively captures global contextual information. Chen et al. [
39] propose a two-branch ViT with skip connections to learn multi-scale features. It proves that learning features from different scales is also effective in vision tasks. Wang et al. [
40] introduce the Pyramid Vision Transformer, which is a unified Transformer backbone for vision tasks with pixel-level prediction that does not require convolution operations. However, there are two challenges in applying Transformer in remote sensing domain, i.e., the various scales of ground objects and the extreme high-resolution images. To limit the high computational complexity of ViT on high-resolution images, Swin Transformer [
41] proposes a shifted window mechanism to construct a hierarchical structure and shows great potential in semantic segmentation. The shifted window mechanism significantly reduces the computation cost and make it possible to process high-resolution images. In the field of medical image segmentation, the Swin Transformer architecture shows great performance [
42], but its potential has not been confirmed in the field of damage assessment.
Although the shifted window mechanism of Swin Transformer makes the computational complexity linear with the input size, this strategy weakens the global modeling capability of Transformer to some extent, which requires additional spatial information to compensate for it. Moreover, in the building damage assessment based on high-resolution optical remote sensing images, certain classes of damage have highly identical appearances, e.g., no damage and minor damage. Therefore, existing methods use an attention mechanism to address these issues. Fu et al. [
43] build long-range correlations through parallel channel attention and position attention. CBAM [
44] constructs spatial-level and channel-level attention for adaptive feature refinement. Chen et al. [
45] propose a feature extractor with a pyramid spatial–temporal attention module for change detection. In remote sensing images, pixel-level spatial correlation should receive more attention to avoid semantic ambiguity due to the occlusion of ground objects [
46]. Therefore, we introduce vertical and horizontal self-attention mechanisms to construct pixel-level spatial correlations.
In this paper, to address the limitation of CNN in global relational modeling, we propose a hierarchical Transformer-based two-stage framework, named SDAFormer, for building damage assessment for the first time. As mentioned earlier, the assessment process is split into two stages to make full use of the semantic information in the pre- and post-disaster images. In Stage 1, a Transformer-based U-Net-like pixel-level segmentation network is used for building localization. Inspired by Residual network [
47] and U-Net [
30], a symmetric encoder–decoder structure with skip connections is constructed as the segmentation network based on Transformer block. Then, the segmentation results are used to guide the building locations for Stage 2. In Stage 2, damage classification is conducted by using a siamese network. The weights trained in Stage 1 are utilized to initialize the network weights of Stage 2 to improve the efficiency of the training process. In addition, a spatial fusion (SF) module is proposed to enable the network to aggregate global features in spatial dimensions. The framework is evaluated on the xBD dataset [
48] and individual disaster datasets.
The main contributions of this paper can be summarized as follows.
We propose a Transformer-based two-stage framework for pre- and post-disaster remote sensing image analysis. Based on the siamese U-Net architecture, a pure Transformer-based encoder–decoder structure is constructed for the building damage assessment task instead of CNN.
To enhance the spatial correlation of global features, a spatial fusion (SF) module is presented. We introduce self-attention in the horizontal and vertical directions to enhance the pixel-level feature representation capability.
The rest of this paper is organized as follows.
Section 2 presents the detailed methods.
Section 3 presents the data used and experimental results.
Section 4 contains the discussion and
Section 5 draws conclusions.
3. Experiment Results
3.1. Experiment Data
3.1.1. xBD Dataset
In our study, the xBD dataset [
48] is used to validate the performance of the proposed method. The xBD dataset is a large-scale public building segmentation and damage assessment dataset with high-quality building annotations from high-resolution satellite images before and after 19 different natural disasters (e.g., earthquakes, volcanic eruptions, hurricanes, and floods). It is sourced from the Maxar/DigitalGlobe Open Data Program [
52], where high-resolution images from many disparate regions of the worlds are available. The dataset consists of pairs of pre-disaster and post-disaster 1024 × 1024 satellite images. The images are below the 0.8 m ground sample distance (GSD) mark. The split of train, validation, and test sets is shown in
Table 1.
The dataset provides 4-level damage labels, including no damage, minor damage, major damage and destroyed. The number of damage annotations of each level is shown in
Table 2. It should be noted that the distribution of each damage level is imbalanced.
3.1.2. Instance Data
Four individual disaster events are used to verify the robustness and transferability of the proposed method. Two of them are the tornadoes in USA, and the other two are Typhoon Yutu in the northern Mariana Islands. The detailed information of these disasters is shown in
Table 3.
3.2. Implementation Details
The proposed method is implemented using Pytorch 1.10. The experimental environment is on a computer with an Intel Core i7-10700 CPU and a NVIDIA RTX-3090 GPU. Simple data argumentation is used to enhance the diversity of the data, including rotation and flip. AdamW is used as the optimization algorithm for backpropagation. The learning rate for Stage 1 (building localization) is 0.00015 and for Stage 2 (damage classification) is 0.0002. The number of epochs for Stage 1 is 120 and for Stage 2 is 20. The pre-trained weights with ImageNet are used for initialization.
3.3. Loss Function
We adopt binary cross-entropy loss for building localization loss
, which is defined as
where
and
are the probability of building location and the reference label. The damage classification outputs a mask of five channels, including one channel of localization and four of damage levels. In order to alleviate the imbalance of samples on damage levels, a weighted mixed loss function which consists of focal loss and dice loss is used for damage classification loss
, which is formulated as:
where
and
are the predicted mask and true mask for channel
n, respectively.
and
are the weights for focal loss and dice loss, respectively.
is the weight for channel
n. Larger weights are set for minor damaged and major damaged, which are uncommon classes (
). Accordingly, a smaller weight is set for localization weight (
).
3.4. Performance Evaluation Metrics
In segmentation tasks, precision and recall are important accuracy indicators. In most cases, it is difficult to evaluate performance well using only one of them. The F1 score represents the balance between precision and recall and can better reflect the overall performance of the model, especially in the case of unbalanced samples. TP (true-positive) represents the number of pixels that are predicted as the right categories. FP (false-positive) denotes the number of pixels from other categories that are incorrectly predicted as this category. FN (false-negative) indicates the number of pixels belonging to this category that are incorrectly classified. In this paper, the XView2 Challenge metric [
48] is used to evaluate the results. The F1 for weighted mean of the building segmentation (
) and the F1 for harmonic average of class damage classification (
) are applied.
refers to the F1 score of each damage class and
represents the
i-th damage class, where
to
denote no damage, minor, major, and destroyed, respectively.
and
are defined as
The final score [
48] of overall evaluation comprehensively reflects the building segmentation and damage classification performance, which is formed based on
and
.
3.5. Comparisons with Other Models on xBD Dataset
To verify the effectiveness of the proposed method in this paper, we compare it with some existing CNN-based methods, including
The Weber’s method [
53] utilizes Mask R-CNN with FPN structure and parallel architecture for both building segmentation and damage classification. It concatenates pre-disaster and post-disaster features after feature extraction with ResNet-50. Then, the fused feature map is fed into the segmentation head for damage assessment. A novel loss function is designed to weight the mistakes on levels inversely proportional to their occurrence in the xBD dataset.
In RescueNet [
54], a dilated ResNet-50 is used for the backbone of the U-Net. To utilize the differences between pre-disaster and post-disaster images, both images are fed into the network for building segmentation. Only the post-disaster image is used in the task of damage classification. Different loss functions are applied to the two tasks separately. Specifically, the Binary Cross-Entropy loss is used for building segmentation, while the foreground-only selective Categorical Cross-Entropy loss is used for damage classification. A dual-head framework is developed, which contains a segmentation head and a change detection head.
The approaches of the top two results from the XView2 Challenge are employed for evaluation, including XView2 1st [
55] and XView2 2nd [
56]. The XView2 1st builds a multi-model ensemble for better performance. XView2 2nd simutaneously applies DPN-92 and DenseNet-161 to U-Net for damage assessment. Both methods use various techniques including data argumentation and multiple test strategies.
Table 4 shows the damage assessment results of different methods on the xBD dataset. In the overall task, our framework performs better and reaches the score of 80.2%. In comparison with the highest overall score of XView2 1st, our method obtains a 1.5% boost. The
is improved compared with MaskRCNN and RescueNet. The
scores also reach their highest, except that the
is slightly lower than the method of XView2 2nd, which may be due to the dual-model strategy of XView2 2nd.
Figure 6 visualizes the damage assessment results of each method. For RescueNet, more errors appear in the damage classification, especially on the minor damage level. For Mask R-CNN, it can be seen that the model outputs more segmentation mistakes, and more mistakes appear on the edges of damaged areas. It can be seen that our proposed model obtains more accurate damage level prediction with smoother boundaries. Overall, the proposed SDAFormer performs best with fewer assessment mistakes.
Figure 7 visualizes the results of the building segmentation, which is Stage 1 of the proposed framework. The results of
Figure 7c,d mistakenly detect the area of cropland as buildings and cannot provide accurate contours of the buildings in the pre-disaster image. It can be seen that our framework achieves more precise segmentation results in comparison with the other methods. In the results of our method, the contours are clear enough for the siamese network to locate the buildings. Thus, the output of Stage 1 is adequate for locating buildings in Stage 2 where damage classification is to be performed.
3.6. Ablation Study
An ablation experiment is conducted to demonstrate the effectiveness of our proposed method. To investigate the effectiveness of the Transformer-based structure, a CNN-based network using the backbone of Res-50 without additional modules as the baseline is implemented for comparison. To evaluate the contribution of the SF module, a Transformer-based network without the SF module is used.
Table 5 shows that the scores of Transformer-based methods exceed the score of the Res-50-based baseline, which proves the effectiveness of the Transformer-based network.
Furthermore, SDAFormer achieves an impressive improvement with the SF module. It can be seen that the score of damage classification () is improved to 77.6%. The result shows that the SF module has little influence on the localization accuracy, but obvious performance gain is achieved on the classification accuracy. The score of minor damage class () obtains a better improvement (from 57.0% to 61.4%), and it shows that the spatial attention mechanism helps enhance the performance on the damage level, which is difficult to recognize.
To further evaluate the influence of the Transformer structure and SF module, three groups of sample images in the test set are picked out for comparison, as shown in
Figure 8,
Figure 9 and
Figure 10, respectively.
Figure 8 illustrates the performance on the detection of undamaged and minor damaged buildings. It can be seen that SDAFormer has more prediction mistakes on minor damage. There are several reasons for this. Firstly, the training set is imbalanced on the damage level annotations and heavily biased toward the level of no damage. Secondly, the high visual similarity between no damage and minor damage leads to the misclassification of these two classes. Moreover, as shown in
Figure 8a,b, the imaging angles of the pre-disaster and post-disaster images are different, which leads to an incomplete overlap of building locations in the two images.
Figure 9 illustrates the performance on the detection of undamaged and major damaged buildings. It can be seen that the Transformer-based models perform well in detecting undamaged buildings. However, due to the complex damage distribution of the local buildings, wrong judgments are made for the major damaged buildings. The baseline method achieves a lower prediction accuracy, where some undamaged buildings are incorrectly classified as major damaged buildings. In comparison with the result of the SDAFormer without the SF module, the SDAFormer with SF module outputs more accurate results on the major damage classification.
Figure 10 illustrates the performance on the detection of destroyed buildings. The assessment result shows that the output of all models is generally correct in terms of building localization. As for the damage assessment, the baseline output in
Figure 10d has a few classification errors in the lower center corner of the image, and the quality of the assessment for small building objects is not satisfactory. The output of SDAFormer can indicate the damage degree of the disaster-affected area, but the results differ in the details. For example, the building group in the center of
Figure 10e is regarded as major damage by SDAFormer without SF. In
Figure 10f, the building group is regarded as undamaged except for the lower right corner. Comparing the pre-disaster and post-disaster buildings in
Figure 10a,b, due to the tsunami, it can be recognized that a giant deviation of the buildings emerged after the disaster, but the structures of the buildings are still largely preserved. However, such buildings are taken as undamaged in the ground truth masks in xBD. The model with SF notices the structural connections of the deviant houses and shows a better performance in this position.
3.7. Robustness and Transferability
Due to the difficulty in obtaining a building damage assessment dataset other than the xBD dataset, we selected four independent disaster events outside the xBD dataset to verify the transferability and robustness of the proposed method. The details of the events are listed in
Table 3. For each instance, the pre-disaster and post-disaster images are fed into our framework. The results are shown in
Figure 11.
In the cases of the tornadoes, the buildings are less dense. Due to the seriously damaged buildings, most of the buildings in the image are labeled as destroyed in our model, which is consistent with the damage situation. The results of RescueNet and MaskRCNN cannot accurately outline the buildings and therefore fail to predict the damage levels.
In the cases of typhoon Yutu, the results of our model shows that the majority of the buildings in the image are well detected and correctly located. Some multi-story buildings cast large areas of shadows on the upper right side, which limits the ability to locate buildings and causes some errors around these buildings. In the damage assessment stage, it can be seen that most of the single-story houses are correctly classified as reasonable damage levels from no damage to destroyed. However, for the multi-story buildings located in the right of the image, the majority of them are recognized as destroyed or major damage. It can be seen from the post-disaster image that the texture features of some of the roofs are changed. For some buildings which are detected as destroyed, the top floors are damaged but the structure of the buildings still remains. Moreover, the effect of side shooting distracts the assessment, which is common on high buildings. On the whole, the predicted results of our method are more precise than other compared methods.
4. Discussion
4.1. Findings and Implications
Pairs of pre-disaster and post-disaster satellite images can reflect the building damage level in the disaster-affected areas in a timely and accurate manner. Therefore, we construct the two-stage SDAFormer framework. Meanwhile, Swin Transformer is introduced to form the framework. According to the analysis of experimental results, our framework has a higher overall score than existing CNN-based methods, which proves the effectiveness of our method. The proposed two-stage framework can consider the temporal and spatial relevance between pre- and post-disaster remote sensing images, which helps to improve the building segmentation and damage assessment.
The application of different types of Transformers in the visual field has been a hot research topic in recent years, which can improve the scalability and performance of many tasks. Transformer can correlate key features in different channels and improve the ability to model the global relationships of the framework in building damage assessment. In our study, Swin Transformer is applied in the encoder, decoder, and bottleneck, which shows the universality of the Swin Transformer block in the building damage assessment field.
In our study, a spatial attention mechanism, also known as the spatial fusion module, is also introduced. For the model with the spatial attention mechanism, the score of building localization is kept constant. Meanwhile, the accuracy of damage classification is improved, especially in the minor and major damage classes. The spatial fusion module has little impact on the semantic segmentation performance of the module, which can use the surrounding texture of the buildings to support damage-level inference. The spatial fusion module is able to enhance the spatial context feature representation and compensate for the limitation of window mechanism in Swin Transformer. As a result, the ambiguity caused by blurred post-disaster buildings can be alleviated, improving the ability to distinguish between minor damage and major damage level.
According to
Table 4, Weber’s method (Mask R-CNN) has much lower
and
than SDAFormer. Unlike our proposed framework, in Weber’s method, a single-branch decoder is applied after concatenating the extracted pre-disaster and post-disaster features. The single-branch decoder results in the inability to utilize the relevant features of the pre-disaster and post-disaster buildings. Therefore, buildings of different sizes are unable to be located properly, which affects the damage assessment result. Meanwhile, the CNN structure limits the global feature extraction capability and the overall understanding of the image. Regarding RescueNet, both pre-disaster and post-disaster images are simultaneously used for building segmentation, but only post-disaster images are used in the process of damage assessment. This strategy can provide additional information at different points in time to the segmentation task. However, since the structure of the buildings is damaged or destroyed by the disaster, the additional semantic information of the post-disaster images can lead to confusion of the network. Meanwhile, the damage classification task lacks the guidance on the localization of the original building, which weakens the ability to detect destroyed buildings because of the seriously damaged appearance.
4.2. Limitations
Although SDAFormer has shown superior performance in the experiments, it still has two limits.
First, with the process of urbanization, it is important to assess urban disaster-affected areas. However, it is still difficult to comprehensively assess the damage of the multi-story buildings for our proposed framework. The information of a building from the satellite image is limited to roofs, while damages on walls cannot be effectively detected. In some cases, these tall buildings are projected into irregular shapes due to the satellite imaging angle, and the sides of the buildings are shown in the satellite images, which may lead to ambiguity of the network. In addition, the shadows of the buildings may change with the satellite acquisition time. However, in the xBD dataset, most of the building objects are low buildings, which weakens the model’s ability to detect tall buildings.
Second, the training samples are imbalanced in terms of damage levels and the disaster categories, as shown in
Table 2 and
Figure 12. The imbalanced training set may affect the training process of the network. Due to the difficulty in acquiring paired high-resolution pre-disaster and post-disaster remote sensing images, the xBD dataset is the first building damage assessment dataset with high-quality annotations. Therefore, it requires a large amount of work to expand the size of the building damage assessment dataset.
4.3. Future Work
We analyzed the advantages and disadvantages of the proposed SDAFormer in detail. In this section, two possible research directions in the future are discussed.
Based on the aforementioned advantages and disadvantages of SDAFormer, further research is needed in the following two aspects.
First, in future experiments, additional relevant datasets from other remote sensing sensors, including an Unmanned Aerial Vehicle (UAV) system, will be used to train the network. The additional data allow our model to learn more key features and verify the universal applicability of SDAFormer in building damage assessment.
Second, on a larger scale, the patterns of damaged buildings are more diverse and complex than the existing classification criteria, Joint Damage Scale (JDS) [
48], especially for buildings in urban areas. The process of label calibration is subjective, and this certain tendency of labeling can affect the training process. Therefore, more detailed categories of damaged buildings can facilitate the model to analyze more useful information to support HADR. We will further explore the classification method for building damage assessment and the application of Transformer architecture in the field of disasters response.