Deep Learning-Based Automated Background Removal for Structural Exterior Image Stitching

: This paper presents a deep learning-based automated background removal technique for structural exterior image stitching. In order to establish an exterior damage map of a structure using an unmanned aerial vehicle (UAV), a close-up vision scanning is typically required. However, unwanted background objects are often captured within the scanned digital images. Since the unnecessary background objects often cause serious distortion on the image stitching process, they should be removed. In this paper, the automated background removal technique using deep learning-based depth estimation is proposed. Based on the fact that the region of interest has closer working distance than the background ones from the camera, the background region within the digital images can be automatically removed using a deep learning-based depth estimation network. In addition, an optimal digital image selection based on feature matching-based overlap ratio is proposed. The proposed technique is experimentally validated using UAV-scanned digital images acquired from an in-situ high-rise building structure. The validation test results show that the optimal digital images obtained from the proposed technique produce the precise structural exterior map with computational cost reduction of 85.7%, while raw scanned digital images fail to construct the structural exterior map and cause serious stitching distortion.


Introduction
Monitoring the integrity of aging structures has become increasingly important in terms of extending structures' service life and saving maintenance costs. For effective monitoring of large-scale structures, unmanned aerial vehicles (UAVs) have recently played a key role, in that faster and safer inspection is possible than expert-dependent visual inspection, even for inaccessible areas by human beings [1][2][3]. One of the most popular UAV-based inspection strategies is that structural exterior damage can be effectively assessed by using UAV-captured digital images. However, damage assessment and making decisions from a number of digital images often be labor-intensive and unreliable, especially as the target structure gets larger. In particular, damage quantification as well as localization are challenging works without structural exterior map establishment. To tackle the technical issues, digital image stitching techniques have been widely accepted for entire structural exterior mapping [4][5][6][7]. As for precise structural damage quantification and localization including micro-scale damage, the close-up and high-resolution spatial scanning of a digital camera-mounted UAV along the entire structural region of interest (ROI) is often required [8][9][10][11]. To construct structural exterior maps using the digital images scanned along a large-scale structure, optimal digital images should be selected by considering the overlap ratio between adjacent digital images to be stitched. The use of all raw digital images for structural exterior map establishment is not effective in terms of high computational cost as well as image stitching accuracy. To address the optimal digital image selection issue, several techniques have been investigated. Yang et al. [12] used a constant time interval technique, which extracts video frames every two seconds. This technique can reduce the spatial redundancy of the acquired video frames, but it cannot meet the constant overlap ratio between the selected digital images. Then, to ensure the constant overlap ratio, Bang et al. [13] proposed a key frame selection technique with the known operational condition of UAV. Similarly, Bu et al. [14] employed monocular simultaneous localization and mapping (SLAM) to stitch UAV-scanned images in real-time. They calculated the relative distance among adjacent images through the weighted combination of translation and rotation in large-scale direct SLAM, and the key frames were then selected by using a certain threshold.
However, in the UAV's close-up vision scanning, digital images often include the target ROI and unnecessary background together especially in the edge of a target structure. The background objects such as sky, mountain, river, tree, etc. disturb stitching as well as selecting optimal digital images, because the background objects have extremely different feature variations from the target structural ROI ones on the sequentially scanned images. In addition, there are more distinguishable image features in the background objects than the repetitive and local target ROI ones, which pose serious distortion and ghosting effects on the image stitching process. To solve this problem, a number of trials have been conducted. For example, Xin et al. [15] proposed a self-adaptive optical flow technique to detect target object regions on the sequential image data. They tried to enhance the object outlines from a rough optical flow field using local mean algorithm, and the target object regions were then extracted. In addition, Supreeth and Patil [16] studied a multiple moving object tracking technique, enabling them to achieve robustness against objects' occlusion, shadows and camera jitter by combining background subtraction and k-means clustering. More recently, Fang et al. [17] proposed a deep learning network, called Tiramisu trained with common objects in context (COCO) dataset, which segments target objects for background removal. Although the aforementioned background removal techniques can be effective tools when it comes to digital images obtained under constant camera pose, in scanning speed and path conditions, the UAV's close-up scanning condition especially for outdoor buildings can be sensitively altered by surrounding environmental conditions as well as operator's skill. Moreover, the conventional background removal techniques highly depend on the image blurs and noises, but the image blur and noise phenomena on the sequential images captured under continuous spatial scanning unfortunately are inevitable in reality.
In this paper, a deep learning-based automated background removal technique, which is suitable for the UAV's close-up scanning condition, is newly proposed. The proposed technique has the following superior characteristics over the existing techniques: (1) the digital image acquisition conditions are not strictly restricted; (2) it is robust against the image blur and noise phenomena; (3) the computational cost can be minimized through optimal image selection using image feature matching-based overlap ratio calculation. The proposed technique is experimentally validated using UAV-scanned digital images acquired from an in-situ high-rise building structure. This paper is organized as follows. First, the deep learning-based automated background removal technique including an optimal digital image selection algorithm is developed in Section 2. Then, the experimental validation results are shown in Section 3. Finally, this paper is concluded with a brief discussion. Figure 1 shows the overview of the structural exterior map establishment through deep learning-based automated background removal and optimal image selection. Once the vision camera mounted-UAV scans the target structure with a short working distance along a predefined scanning path, the spatially continuous digital images can be acquired for high-resolution structural exterior map establishment. To properly stitch the scanned digital images, the ROI images including only a target structure need to be extracted from the field of view (FOV) images. Since the FOV images, which are especially obtained from the edge of the target structure as shown in Figure 1, inevitably contain the background objects as well as ROI, the background regions are removed using a deep learning-based depth estimation network. Subsequently, the optimal images for minimizing stitching errors as well as computational costs are selected from the entire video frames based on overlap ratio calculation. Finally, the structural exterior map is constructed using a mesh-based image stitching method as shown in Figure 1. The details of each procedure are as follows.

Structural Exterior Image Stitching through Automated Background Removal
Appl. Sci. 2021, 11, x FOR PEER REVIEW 3 of 12 the field of view (FOV) images. Since the FOV images, which are especially obtained from the edge of the target structure as shown in Figure 1, inevitably contain the background objects as well as ROI, the background regions are removed using a deep learning-based depth estimation network. Subsequently, the optimal images for minimizing stitching errors as well as computational costs are selected from the entire video frames based on overlap ratio calculation. Finally, the structural exterior map is constructed using a meshbased image stitching method as shown in Figure 1. The details of each procedure are as follows.

Figure 1.
Overview of structural exterior map establishment through deep learning-based automated background removal and optimal image selection: FOV and ROI denote the field of view and region of interest, respectively. Figure 2 shows the deep learning-based automated background removal process. The key idea of this step is under assumption that structural ROI, which is obtained through the UAV's close-up scanning, is much closer than the background objects from the UAV-embedded digital camera within FOV. Thus, a deep learning-based depth estimation network, called Monodepth2, which is based on a U-net architecture, is employed in this study [18]. Monodepth2 is trained in a self-supervised manner by exploiting spatial geometry constraints. Monodepth2 utilizes a full-resolution multi-scale sampling method for reducing visual artifacts and an auto masking loss to ignore training pixels that violate camera motion assumption. This network can rapidly estimate depth value using only monocular RGB images, thus it is suitable for high-resolution image processing. The effectiveness of Monodepth2 was validated by comparing with 28 other depth estimation models using measurement metrics [18]. The employed model was implemented in PyTorch and trained for 20 epochs using an Adam optimizer with a batch size of 12. The learning rate of 10is used for the first 15 epochs which is then dropped to 10for the remainder. The smoothness term is set to 0.001. The KITTI dataset is used for pre-training, and 10% of the dataset is used as a validation set.   The key idea of this step is under assumption that structural ROI, which is obtained through the UAV's close-up scanning, is much closer than the background objects from the UAV-embedded digital camera within FOV. Thus, a deep learning-based depth estimation network, called Monodepth2, which is based on a U-net architecture, is employed in this study [18]. Monodepth2 is trained in a self-supervised manner by exploiting spatial geometry constraints. Monodepth2 utilizes a full-resolution multi-scale sampling method for reducing visual artifacts and an auto masking loss to ignore training pixels that violate camera motion assumption. This network can rapidly estimate depth value using only monocular RGB images, thus it is suitable for high-resolution image processing. The effectiveness of Monodepth2 was validated by comparing with 28 other depth estimation models using measurement metrics [18]. The employed model was implemented in PyTorch and trained for 20 epochs using an Adam optimizer with a batch size of 12. The learning rate of 10 −4 is used for the first 15 epochs which is then dropped to 10 −5 for the remainder. The smoothness term is set to 0.001. The KITTI dataset is used for pre-training, and 10% of the dataset is used as a validation set.

Automated Background Removal Using Deep Learning-Based Depth Estimation
Once the depth values were estimated with respect to each pixel on the background, including FOV images acquired from the edge of the target structure, the depth image was obtained as shown in Figure 2. Subsequently, the ROI boundary was extracted by using depth difference. However, precise ROI boundary extraction is often difficult due to undesired noise components on the depth images. Thus, post image processing was necessary. On the depth images in Figure 2, brighter pixels indicate the closer working distance from the digital camera. The ROI boundaries within the depth image can be extracted by using the magnitude of depth gradients (G), which is given by: where I is the depth image corresponding to the FOV image, and ∂I ∂x and ∂I ∂y are the depth gradients of I along the x and y directions, respectively. To precisely extract the ROI boundaries, the G image was binarized by an Otsu's method, so that each pixel had a value 1 or 0 [19]. Here, dotted pepper noises were removed on the boundary image, resulting in clear a ROI boundary consisting of consecutive pixel sets as depicted in Figure 2. Then, ROI was extracted by retaining the region, which has smaller mean depth values on the depth image across the ROI boundary. Finally, the background region was automatically removed by overlapping the extracted ROI region on the FOV image. Once the depth values were estimated with respect to each pixel on the background, including FOV images acquired from the edge of the target structure, the depth image was obtained as shown in Figure 2. Subsequently, the ROI boundary was extracted by using depth difference. However, precise ROI boundary extraction is often difficult due to undesired noise components on the depth images. Thus, post image processing was necessary. On the depth images in Figure 2, brighter pixels indicate the closer working distance from the digital camera. The ROI boundaries within the depth image can be extracted by using the magnitude of depth gradients ( ), which is given by: where is the depth image corresponding to the FOV image, and and are the depth gradients of along the and directions, respectively. To precisely extract the ROI boundaries, the image was binarized by an Otsu's method, so that each pixel had a value 1 or 0 [19]. Here, dotted pepper noises were removed on the boundary image, resulting in clear a ROI boundary consisting of consecutive pixel sets as depicted in Figure  2. Then, ROI was extracted by retaining the region, which has smaller mean depth values on the depth image across the ROI boundary. Finally, the background region was automatically removed by overlapping the extracted ROI region on the FOV image.

Optimal Image Selection for Cost-Effective Digital Image Stitching
To construct the structural exterior map, a number of spatially continuous digital images, which is often expressed by video frames, should be acquired, because FOV is often much smaller than the entire ROI of a large-scale structure. Moreover, image resolution should be large enough to inspect micro-scale damage on the target structure. For these reasons, structural exterior map construction using entire scanned digital images typically require tremendous computational costs. Furthermore, image stitching errors are often inversely increased when the entire scanned digital images are excessively used.

Optimal Image Selection for Cost-Effective Digital Image Stitching
To construct the structural exterior map, a number of spatially continuous digital images, which is often expressed by video frames, should be acquired, because FOV is often much smaller than the entire ROI of a large-scale structure. Moreover, image resolution should be large enough to inspect micro-scale damage on the target structure. For these reasons, structural exterior map construction using entire scanned digital images typically require tremendous computational costs. Furthermore, image stitching errors are often inversely increased when the entire scanned digital images are excessively used. In order to address these technical issues, optimal image or frame selection is necessary. In this step, the optimal image selection algorithm using image feature matching-based overlap ratio calculation was proposed. Figure 3 describes the flow chart of the overlap ratio-based optimal image selection algorithm. First, image features such as point, corner or edges were extracted from every ROI image (I 1 , I 2 , I 3 . . . I n ) using a scale-invariant feature transform (SIFT) [20]. Since SIFT is invariant to image translation, scaling, rotation and partially invariant to illumination changes, it is advantageous for UAV's close-up scanning data processing. Next, the image features were initially matched between the adjacent images, which are defined as the matching vectors. Then, the false matching vectors were removed by using a random sample consensus (RANSAC) because similar image features on the repeated target structure's texture are often mismatched. After RANSAC, the correct matching vectors, called the inlier vectors, were obtained which physically imply how much I q was translated from I p along the x and y directions. Based on the assumption that there is no working distance change between the target structure and the digital camera mounted on UAV, the overlap ratio R between I p and I q was calculated using the inlier vectors' averaged magnitude along the x and y directions, which is given by: where Height and Width are the height and width of I p . µ and ν are the inlier vectors' averaged magnitude along the x and y directions as shown in Figure 3. Once R was calculated, the optimal image that satisfies the predefined threshold of R (R th ) was determined.
The above procedure was repeated by the iteration as shown in Figure 3, until the entire optimal images were obtained.
Appl. Sci. 2021, 11, x FOR PEER REVIEW 5 of 12 sample consensus (RANSAC) because similar image features on the repeated target structure's texture are often mismatched. After RANSAC, the correct matching vectors, called the inlier vectors, were obtained which physically imply how much was translated from along the and directions. Based on the assumption that there is no working distance change between the target structure and the digital camera mounted on UAV, the overlap ratio between and was calculated using the inlier vectors' averaged magnitude along the and directions, which is given by: where ℎ and ℎ are the height and width of . and are the inlier vectors' averaged magnitude along the and directions as shown in Figure 3. Once was calculated, the optimal image that satisfies the predefined threshold of ( ) was determined. The above procedure was repeated by the iteration as shown in Figure 3, until the entire optimal images were obtained. is the overlap ratio. and are the iteration variables. is the total number of the FOV images. and are the background-removed ROI images.

Mesh-Based Digital Image Stitching for Structural Exterior Map Establishment
In order to establish the precise structural exterior map, the local warp with a grid mesh is often used. In this study, a mesh-based digital image stitching method, called natural image stitching with the global similarity prior (NISwGSP), was employed [21]. Once the optimal images (I j and I j+1 ) were selected in Step 2, the homography matrix H Appl. Sci. 2021, 11, 3339 6 of 12 (H∈ R 3×3 ), which is reshaped fromĥ, was estimated between each optimal image using the following equation:ĥ where [x i y i ] and [x i y i ] are the matched feature points of I j and I j+1 , respectively. N is the number of the matched feature points, and h m is the m th row components of H. The elements of H was be obtained using direct linear transformation (DLT), and the solution became the least significant right singular vector of {a i } N i = 1 . Given H, an arbitrary pixel position X * on I j was warped to the pixel position X * on I j+1 , using location dependent homography matrix H * [22].
where X * and X * are the homogeneous coordinates of X * and X * , and ω i * is the scalar weight. ρ is the scale parameter, and X i is the position of extracted feature point in I j . Similarly, Equation (4) was solved by using DLT. However, solving Equation (4) with respect to all pixel position X * on I j , is not effective in terms of computational cost, because neighboring pixel positions often produce the same H * . Thus, I j and I j+1 were divided into the mesh composed of C x × C y cells [23]. For each cell, the center point was chosen as X * , and all pixels within the cell were warped by the same H * as shown in Figure 4. Finally, the structural exterior map was constructed through mesh optimization and image mapping [21].

Mesh-Based Digital Image Stitching for Structural Exterior Map Establishment
In order to establish the precise structural exterior map, the local warp with a grid mesh is often used. In this study, a mesh-based digital image stitching method, called natural image stitching with the global similarity prior (NISwGSP), was employed [21]. Once the optimal images ( and ) were selected in Step 2, the homography matrix (H∈ ℝ ), which is reshaped from ℎ , was estimated between each optimal image using the following equation: where [ ] and [ ] are the matched feature points of and , respectively. is the number of the matched feature points, and ℎ is the row components of . The elements of was be obtained using direct linear transformation (DLT), and the solution became the least significant right singular vector of . Given , an arbitrary pixel position Χ * on was warped to the pixel position Χ * on , using location dependent homography matrix * [22].
where Χ * and Χ * are the homogeneous coordinates of Χ * and Χ * , and * is the scalar weight. is the scale parameter, and X is the position of extracted feature point in .
Similarly, Equation (4) was solved by using DLT. However, solving Equation (4) with respect to all pixel position Χ * on , is not effective in terms of computational cost, because neighboring pixel positions often produce the same * . Thus, and were divided into the mesh composed of × cells [23]. For each cell, the center point was chosen as Χ * , and all pixels within the cell were warped by the same * as shown in Figure 4. Finally, the structural exterior map was constructed through mesh optimization and image mapping [21].

Experimental Validation
The proposed technique was experimentally validated using a vision camera mounted-UAV at an in-situ 18 story building. The overall test procedures were as follows. First, the vision camera mounted-UAV scanned the target structure along a predefined scanning path to acquire the spatially continuous FOV images. Then, the automated background removal and the optimal image selection procedures were sequentially conducted. Finally, the structural exterior map was constructed using the mesh-based image stitching method. To show the superiority of the proposed technique, the test results were compared with the raw digital image ones.

Experimental Validation
The proposed technique was experimentally validated using a vision camera mounted-UAV at an in-situ 18 story building. The overall test procedures were as follows. First, the vision camera mounted-UAV scanned the target structure along a predefined scanning path to acquire the spatially continuous FOV images. Then, the automated background removal and the optimal image selection procedures were sequentially conducted. Finally, the structural exterior map was constructed using the mesh-based image stitching method. To show the superiority of the proposed technique, the test results were compared with the raw digital image ones. Figure 5a,b show the target building and digital camera (L1D-20c of Hasselblad) mounted-UAV, Mavic 2 of DJI, used in this study. To evaluate the feasibility of the proposed technique, the UAV scanned along right-hand-side edge of the target structure, while the Appl. Sci. 2021, 11, 3339 7 of 12 working distance of 4 m between the target ROI and UAV was kept. The digital images were obtained with the resolution of 3840 × 2160 pixels and 30 frames per second in video format. Figure 5a and b show the target building and digital camera (L1D-20c of Hasselblad) mounted-UAV, Mavic 2 of DJI, used in this study. To evaluate the feasibility of the proposed technique, the UAV scanned along right-hand-side edge of the target structure, while the working distance of 4 m between the target ROI and UAV was kept. The digital images were obtained with the resolution of 3840 × 2160 pixels and 30 frames per second in video format.  Figure 6 shows the representative background removal results. The depth image of Figure 6b was obtained by using Monodepth2 from the raw FOV image of Figure 6a. Although the depth image contained the undesired noise components caused by scenery complexity such as window, target structure's texture and background objects, the ROI boundary between the target structure and the background objects was successfully extracted using Equation (1), as shown in Figure 6c. Then, the background region displayed in the right-side across the edge boundary on Figure 6c was removed, and the only ROI successfully remained as shown in Figure 6d. In order to quantitatively evaluate the accuracy of the depth image-based background removal results, a pixel-level error ratio between the resultant ROI image and its ground truth shown in Figure 6e was calculated.  Figure 7 shows the error ratio results obtained from all the background removed-ROI images. The averaged error ratio along the entire ROI images turns out around 0.75%, which reveals that the proposed algorithm had over 99% accuracy for background removal and is acceptable for the subsequent structural exterior map construction.  Although the depth image contained the undesired noise components caused by scenery complexity such as window, target structure's texture and background objects, the ROI boundary between the target structure and the background objects was successfully extracted using Equation (1), as shown in Figure 6c. Then, the background region displayed in the right-side across the edge boundary on Figure 6c was removed, and the only ROI successfully remained as shown in Figure 6d. In order to quantitatively evaluate the accuracy of the depth image-based background removal results, a pixel-level error ratio between the resultant ROI image and its ground truth shown in Figure 6e was calculated.
Appl. Sci. 2021, 11, x FOR PEER REVIEW 7 of 12 Figure 5a and b show the target building and digital camera (L1D-20c of Hasselblad) mounted-UAV, Mavic 2 of DJI, used in this study. To evaluate the feasibility of the proposed technique, the UAV scanned along right-hand-side edge of the target structure, while the working distance of 4 m between the target ROI and UAV was kept. The digital images were obtained with the resolution of 3840 × 2160 pixels and 30 frames per second in video format.  Figure 6 shows the representative background removal results. The depth image of Figure 6b was obtained by using Monodepth2 from the raw FOV image of Figure 6a. Although the depth image contained the undesired noise components caused by scenery complexity such as window, target structure's texture and background objects, the ROI boundary between the target structure and the background objects was successfully extracted using Equation (1), as shown in Figure 6c. Then, the background region displayed in the right-side across the edge boundary on Figure 6c was removed, and the only ROI successfully remained as shown in Figure 6d. In order to quantitatively evaluate the accuracy of the depth image-based background removal results, a pixel-level error ratio between the resultant ROI image and its ground truth shown in Figure 6e was calculated.  Figure 7 shows the error ratio results obtained from all the background removed-ROI images. The averaged error ratio along the entire ROI images turns out around 0.75%, which reveals that the proposed algorithm had over 99% accuracy for background removal and is acceptable for the subsequent structural exterior map construction.  Figure 7 shows the error ratio results obtained from all the background removed-ROI images. The averaged error ratio along the entire ROI images turns out around 0.75%, which reveals that the proposed algorithm had over 99% accuracy for background removal and is acceptable for the subsequent structural exterior map construction.
Next, the representative overlap ratio calculation results are shown in Figure 8. In Figure 8a, 14,421 and 13,850 numbers of image features were extracted from I 1 and I 7 , respectively. The extracted features were initially matched as displayed in Figure 8a, and only inlier vectors were then remained as shown in Figure 8b. Subsequently, 66.49% of R was calculated using Equation (2)  Next, the representative overlap ratio calculation results are shown in Figure 8. In Figure 8a, 14,421 and 13,850 numbers of image features were extracted from and , respectively. The extracted features were initially matched as displayed in Figure 8a, and only inlier vectors were then remained as shown in Figure 8b. Subsequently, 66.49% of was calculated using Equation (2) as shown in Figure 8b.  In order to properly stitch the digital images, it was generally recommended that has greater than 50% [24]. In this study, of 80% was used due to local image features such as window and target structure's texture, which were extremely repeated in the sequential image data and may increase the false feature matching. As shown in Figure 9, straightforwardly decreased as increased when = 1, becomes 82.9% and 77.2%  Next, the representative overlap ratio calculation results are shown in Figure 8. In Figure 8a, 14,421 and 13,850 numbers of image features were extracted from and , respectively. The extracted features were initially matched as displayed in Figure 8a, and only inlier vectors were then remained as shown in Figure 8b. Subsequently, 66.49% of was calculated using Equation (2) as shown in Figure 8b. In order to properly stitch the digital images, it was generally recommended that has greater than 50% [24]. In this study, of 80% was used due to local image features such as window and target structure's texture, which were extremely repeated in the sequential image data and may increase the false feature matching. As shown in Figure 9, straightforwardly decreased as increased when = 1, becomes 82.9% and 77.2% In order to properly stitch the digital images, it was generally recommended that R has greater than 50% [24]. In this study, R th of 80% was used due to local image features such as window and target structure's texture, which were extremely repeated in the sequential image data and may increase the false feature matching. As shown in Figure 9, R straightforwardly decreased as q increased when p = 1, R becomes 82.9% and 77.2% corresponding to q = 4 and q = 5, respectively, as shown in Figure 9. Therefore, I 4 was selected as the optimal image with respect to p = 1. Once I 4 was selected, the next optimal image can be similarly selected from I 4 . This procedure was repeated until the entire optimal images were determined for structural exterior map establishment. corresponding to = 4 and = 5, respectively, as shown in Figure 9. Therefore, was selected as the optimal image with respect to = 1. Once was selected, the next optimal image can be similarly selected from . This procedure was repeated until the entire optimal images were determined for structural exterior map establishment.  Figure 10 shows the raw and selected optimal images. In total, 118 raw FOV images, which were acquired from UAV's close-up scanning, contained various background objects as shown in Figure 10a. On the other hand, only 35 background removed-ROI images were extracted through automated background removal and optimal image selection algorithms as shown in Figure 10b. To validate the compatibility of these two images' data, the structural exterior maps were constructed and compared.  Figure 11 compares structural exterior map establishment results between using the raw and optimal images. When the raw images were used, it failed to construct the structural exterior map due to serious distortion and ghosting effects, as shown in Figure 11a. It turned out that the images were stitched according to the background features, because the background objects had more distinguishable image features than the ROI ones. On the other hand, the structural exterior map using the optimal images was properly constructed without distortion and ghosting effects on the ROI, as shown in Figure 11b.  Figure 10 shows the raw and selected optimal images. In total, 118 raw FOV images, which were acquired from UAV's close-up scanning, contained various background objects as shown in Figure 10a. On the other hand, only 35 background removed-ROI images were extracted through automated background removal and optimal image selection algorithms as shown in Figure 10b. To validate the compatibility of these two images' data, the structural exterior maps were constructed and compared.
Appl. Sci. 2021, 11, x FOR PEER REVIEW 9 of 12 corresponding to = 4 and = 5, respectively, as shown in Figure 9. Therefore, was selected as the optimal image with respect to = 1. Once was selected, the next optimal image can be similarly selected from . This procedure was repeated until the entire optimal images were determined for structural exterior map establishment.  Figure 10 shows the raw and selected optimal images. In total, 118 raw FOV images, which were acquired from UAV's close-up scanning, contained various background objects as shown in Figure 10a. On the other hand, only 35 background removed-ROI images were extracted through automated background removal and optimal image selection algorithms as shown in Figure 10b. To validate the compatibility of these two images' data, the structural exterior maps were constructed and compared.  Figure 11 compares structural exterior map establishment results between using the raw and optimal images. When the raw images were used, it failed to construct the structural exterior map due to serious distortion and ghosting effects, as shown in Figure 11a. It turned out that the images were stitched according to the background features, because the background objects had more distinguishable image features than the ROI ones. On the other hand, the structural exterior map using the optimal images was properly constructed without distortion and ghosting effects on the ROI, as shown in Figure 11b.  Figure 11 compares structural exterior map establishment results between using the raw and optimal images. When the raw images were used, it failed to construct the structural exterior map due to serious distortion and ghosting effects, as shown in Figure 11a. It turned out that the images were stitched according to the background features, because the background objects had more distinguishable image features than the ROI ones. On the other hand, the structural exterior map using the optimal images was properly constructed without distortion and ghosting effects on the ROI, as shown in Figure 11b

Discussion
Since this work was not to newly train a deep learning network, but to employ a suitable pre-trained and pre-validated deep learning model for cost-effective and automated background removal, the suitability of the employed Monodepth2 was additionally tested. To show the effectiveness of depth estimation results, the relative mean depth values between the ROI and background regions within FOV images were calculated as shown in Figure 12. The higher depth values mean the closer region from the digital camera mounted on the UAV. It can be easily observed that the ROIs' relative mean depth values were consistently higher than the background ones without any overlap between them. This means that the relative depth values were successfully estimated using Monodepth2, and the corresponding ROIs were properly extracted. Here, Monodepth2 works well because this work is under assumption that the structural ROI is much closer than the background objects from the UAV. However, the depth estimation errors may increase when the working distance between the ROI and background objects is similar.
In addition, since one of the critical obstacles to construct the large-scale structural exterior map is the computational cost, the computational time was compared between the raw image and optimal image cases in Table 1. The structural exterior map establishment took 9 h 13 min 47 s using 118 raw images, while it just conducted within 1 h 18 min 57 s using 35 optimal images. Note that the computational times were estimated when it comes to CPU of Intel ® Xeon E5-2630 v4 with 64 gigabytes of memory. The optimal image case includes the optimal image selection time of 40 min 49 s, which is about 51.69% out of total 1 h 18 min 57 s. These results indicate that the proposed technique is critical for structural exterior map construction performance, and also can extremely reduce the computational cost of almost 85.7%.

Discussion
Since this work was not to newly train a deep learning network, but to employ a suitable pre-trained and pre-validated deep learning model for cost-effective and automated background removal, the suitability of the employed Monodepth2 was additionally tested. To show the effectiveness of depth estimation results, the relative mean depth values between the ROI and background regions within FOV images were calculated as shown in Figure 12. The higher depth values mean the closer region from the digital camera mounted on the UAV. It can be easily observed that the ROIs' relative mean depth values were consistently higher than the background ones without any overlap between them. This means that the relative depth values were successfully estimated using Monodepth2, and the corresponding ROIs were properly extracted. Here, Monodepth2 works well because this work is under assumption that the structural ROI is much closer than the background objects from the UAV. However, the depth estimation errors may increase when the working distance between the ROI and background objects is similar.
In addition, since one of the critical obstacles to construct the large-scale structural exterior map is the computational cost, the computational time was compared between the raw image and optimal image cases in Table 1. The structural exterior map establishment took 9 h 13 min 47 s using 118 raw images, while it just conducted within 1 h 18 min 57 s using 35 optimal images. Note that the computational times were estimated when it comes to CPU of Intel ® Xeon E5-2630 v4 with 64 gigabytes of memory. The optimal image case includes the optimal image selection time of 40 min 49 s, which is about 51.69% out of total 1 h 18 min 57 s. These results indicate that the proposed technique is critical for structural exterior map construction performance, and also can extremely reduce the computational cost of almost 85.7%.

Conclusions
This paper proposed a deep learning-based automated background removal technique for structural exterior image stitching. The effectiveness of the proposed technique was experimentally demonstrated through in-situ high-rise building structure tests with a vision camera mounted-unmanned aerial vehicle (UAV). Then, the test results were compared with the structural exterior map constructed using non-treated raw images. The validation test results obtained using the proposed technique revealed that the structural exterior map was properly constructed without distortion and ghosting effects. On the other hand, the structural exterior map using raw images without any image processing showed serious distortion and ghosting effects on region of interest. Furthermore, the proposed technique constructed the precise structural exterior map with a computational cost reduction of 85.7% versus the raw image case. Although the proposed technique can highly depend on the accuracy of depth estimation, it can be one of the promising tools for automatically establishing structural exterior maps using UAV's close-up scanned images with low computational cost. As a follow-up study, an advanced image stitching algorithm that is robust against test environmental variation is now being developed. Furthermore, a deep learning-based automated structural damage detection algorithm incorporated with the precise structural exterior map will be developed to extend the applicability of the proposed technique.

Conclusions
This paper proposed a deep learning-based automated background removal technique for structural exterior image stitching. The effectiveness of the proposed technique was experimentally demonstrated through in-situ high-rise building structure tests with a vision camera mounted-unmanned aerial vehicle (UAV). Then, the test results were compared with the structural exterior map constructed using non-treated raw images. The validation test results obtained using the proposed technique revealed that the structural exterior map was properly constructed without distortion and ghosting effects. On the other hand, the structural exterior map using raw images without any image processing showed serious distortion and ghosting effects on region of interest. Furthermore, the proposed technique constructed the precise structural exterior map with a computational cost reduction of 85.7% versus the raw image case. Although the proposed technique can highly depend on the accuracy of depth estimation, it can be one of the promising tools for automatically establishing structural exterior maps using UAV's close-up scanned images with low computational cost. As a follow-up study, an advanced image stitching algorithm that is robust against test environmental variation is now being developed. Furthermore, a deep learning-based automated structural damage detection algorithm incorporated with the precise structural exterior map will be developed to extend the applicability of the proposed technique.