Three-Dimensional Convolutional Neural Network on Multi-Temporal Synthetic Aperture Radar Images for Urban Flood Potential Mapping in Jakarta

: Flooding in urban areas is counted as a signiﬁcant disaster that must be correctly mitigated due to the huge amount of affected people, material losses, hampered economic activity, and ﬂood-related diseases. One of the technologies available for disaster mitigation and prevention is satellites providing image data on previously ﬂooded areas. In most cases, ﬂoods occur in conjunction with heavy rain. Thus, from a satellite’s optical sensor, the ﬂood area is mostly covered with clouds which indicates ineffective observation. One solution to this problem is to use Synthetic Aperture Radar (SAR) sensors by observing backscatter differences before and after ﬂood events. This research proposes mapping the ﬂood-prone areas using machine learning to classify the areas using the 3D CNN method. The method was applied on a combination of co-polarized and cross-polarized SAR multi-temporal image datasets covering Jakarta City and the coastal area of Bekasi Regency. Testing with multiple combinations of training/testing data proportion split and a different number of epochs gave the optimum performance at an 80/20 split with 150 epochs achieving an overall accuracy of 0.71 after training in 283 min. with 0.719. Testing with 140, 145, 155, and 160 epochs gives lower results.


Introduction
Flooding is one of the most detrimental disasters, especially in cities such as Jakarta, because it affects a large number of residents in ways such as material losses resulting from damaged properties due to flood inundation and diseases caused by degraded sanitation in the flooded area. A major flood in Jakarta results in 8.7 trillion IDR or 625 million USD of losses and recovery efforts [1]. At present, most of the flood mapping in Indonesia has not fully utilized satellite spatial data because it still relies on data reported by the local government in the form of numerical data [2]. The visualization of the flood map is based on tabulated data in the area map that does not represent the actual conditions, resulting in a discrepancy between the reported flood area and the actual area. This difference will affect the handling of floods, such as calculating the impact of damage, the number of residents affected by the flood, and the inefficient distribution of aid. Problems that arise due to limited spatial information regarding floods can be solved by using multi-sensor remote sensing satellite data. Many technologies have been developed to predict, prevent and mitigate flood disasters more accurately, including remote sensing technology using images obtained from airborne and spaceborne platforms [3][4][5]. The earlier and most common form of remote sensing is optical photography, with overhead images providing information on the affected area.
Linear Unit (ReLU). This study proposes three-dimensional CNN in mapping flood areas to filter and weigh neurons and map the potential flood areas in urban areas with better accuracy and fewer number of images.
Yu Li et al. proposed an Active Self Learning method on CNN to detect floods in urban areas from the SAR image ensemble [37]. The dataset used is four TerraSAR-X images of HH polarization with the composition of one pre-event image, one co-event image, and two post-event images. Linyi Li et al. proposed a high-resolution urban flood mapping method (Super-Resolution Mapping of Urban Flood, SMUF) with the fusion of the Support Vector Machine and General Regression Neural Network (FSVM-GRNN) [35]. Because the urban flood area in the observation area is not very dense, the accuracy of this FSVM-GRNN is 80.2%.
Shen et al. proposed a machine learning process to make corrections to the mapping of flood inundation areas in near-real-time (NRT) using SAR, where the observation area is an open area without many obstacles on the surface [41]. At the time of segmentation, there are difficulties in classifying areas that are flooded with areas that have surface reflection properties similar to those of the water surface. ML is performed to correct speckle noise and another scattering, which can interfere with data reading and classification. The filtering method is used in most SAR image processing but its effect is to reduce the effective resolution and change the signal statistics and cannot completely remove noise. To overcome this, Shen et al. used the Logistic Binary Classifier (LBC) in a correction step to practice detecting the presence of water in the pixels contained in the water bodies and the surrounding buffer areas.
The objective of this work is to investigate the mapping of flood potential in Jakarta and nearby coastal areas using three-dimensional CNN on co-polarized (VV) and crosspolarized (VH) Sentinel-1a SAR images. A three-dimensional classification combines the two-dimensional image and one-dimension multi-temporal processes into a single convolution. The images are then pre-processed into grayscale images to be converted into a vector data format. The 2 January 2020 images were also sampled as flood and non-flood target sub-images, along with the corresponding locations from other images, to form the multi-temporal value changes of the flooded locations along with the consistency of the non-flooded locations. The CNN training is performed with training/test percentage values of 70/30; 80/20; and 90/10 with varying epochs between 100 and 160 iterations to obtain the best combination with the highest accuracy and the shortest processing time.

Location and Data
A radar image is generated from the reflection of active microwaves emitted from the radar vehicle (airplane or satellite). The transmitter in a radar system is an antenna array consisting of a waveguide and emitting a very narrow beam of microwave waves. The radar sensor moves along a trajectory, and the area highlighted by the radar moves (known as the footprint) along the surface being swept to form an image. A digital radar image consists of many pixel dots representing the backscatter or backscatter of a point on the surface. Figure 1 shows an example of a SAR image from 2 January 2020 that is free from cloud cover with the bright dots showing high backscattering while the dark dots represent low backscattering, while Figure 2 is the optical image from the same date showing cloud cover.
The radar system generally has a wavelength undisturbed by interference from water particles and water vapor in the air (clouds and rain). Because they are not dependent on illumination (irradiation) from the sun or other sources, the radar system can function day and night and in all weather. Synthetic Aperture Radar (SAR) works by detecting the phase-change of reflected signal caused by the movement of the platform to obtain the surface image with good resolution (i.e., visually discernible). The SAR system is generally divided into two wavelengths, namely short (C-band and X-band) and long (L-band and P-band) waves. Early SAR satellite systems use a single platform such as Radarsat. Currently, the most commonly used are satellite constellation systems such as TerraSAR-X and TanDEM-X pair, four-satellites Cosmo-SkyMed in X-band, three-satellites Radarsat Constellation Mission, and Sentinel 1 satellite pair, which give shorter revisit time and higher temporal resolution [20,[41][42][43]. Figure 3 shows the backscatter mechanism of shortwave radar (illustrated with black arrows) and long (illustrated arrows in blue) on various surfaces under normal conditions and during a flood. On the grass surface, there are surface reflections at both wavelengths due to the relative roughness of the surface. For short waves, the scattering is due to the thickness of the grass, while long waves can penetrate deeper [44].  The radar system generally has a wavelength undisturbed by interference from water particles and water vapor in the air (clouds and rain). Because they are not dependent on illumination (irradiation) from the sun or other sources, the radar system can function day and night and in all weather. Synthetic Aperture Radar (SAR) works by detecting the phase-change of reflected signal caused by the movement of the platform to obtain the surface image with good resolution (i.e., visually discernible). The SAR system is generally divided into two wavelengths, namely short (C-band and X-band) and long (L-band and P-band) waves. Early SAR satellite systems use a single platform such as Radarsat. Currently, the most commonly used are satellite constellation systems such as TerraSAR-X and TanDEM-X pair, four-satellites Cosmo-SkyMed in X-band, three-satellites Radarsat Constellation Mission, and Sentinel 1 satellite pair, which give shorter revisit time and higher temporal resolution [20,[41][42][43]. Figure 3 shows the backscatter mechanism of shortwave radar (illustrated with black arrows) and long (illustrated arrows in blue) on various surfaces under normal conditions and during a flood. On the grass surface, there are surface reflections at both wavelengths due to the relative roughness of the surface. For short waves, the scattering is due to the thickness of the grass, while long waves can  The radar system generally has a wavelength undisturbed by interference from water particles and water vapor in the air (clouds and rain). Because they are not dependent on illumination (irradiation) from the sun or other sources, the radar system can function day and night and in all weather. Synthetic Aperture Radar (SAR) works by detecting the phase-change of reflected signal caused by the movement of the platform to obtain the surface image with good resolution (i.e., visually discernible). The SAR system is generally divided into two wavelengths, namely short (C-band and X-band) and long (L-band and P-band) waves. Early SAR satellite systems use a single platform such as Radarsat. Currently, the most commonly used are satellite constellation systems such as TerraSAR-X and TanDEM-X pair, four-satellites Cosmo-SkyMed in X-band, three-satellites Radarsat Constellation Mission, and Sentinel 1 satellite pair, which give shorter revisit time and higher temporal resolution [20,[41][42][43]. Figure 3 shows the backscatter mechanism of shortwave radar (illustrated with black arrows) and long (illustrated arrows in blue) on various surfaces under normal conditions and during a flood. On the grass surface, there are surface reflections at both wavelengths due to the relative roughness of the surface. For short waves, the scattering is due to the thickness of the grass, while long waves can When a flood occurs, specular reflection occurs in both types of waves. On objects in the form of trees or forests, the reflection is dominated by the scattering volume. For short waves, the scattering comes from the canopy (leaves) of the trees, while the longwave scattering by the branches and other tree structures is added by a double-bounce, which hits the ground surface and then the tree trunk or vice versa. When there is a flood, the double reflection will get more significant due to the specular reflection on the water surface (shown as a thick line of the direction of the reflection). In urban areas, the reflection on both waves is dominated by multiple reflections, although the surface will appear coarser on short waves. When there is a flood, this double reflection will be significantly strengthened due to the specular reflection on the water surface (shown as a thick line of the direction of the reflection).
hits the ground surface and then the tree trunk or vice versa. When there is a flood, the double reflection will get more significant due to the specular reflection on the water surface (shown as a thick line of the direction of the reflection). In urban areas, the reflection on both waves is dominated by multiple reflections, although the surface will appear coarser on short waves. When there is a flood, this double reflection will be significantly strengthened due to the specular reflection on the water surface (shown as a thick line of the direction of the reflection). In this study, the flood data collected came from Sentinel-1a remote sensing satellite. The data are downloaded through Google Earth Engine by the Copernicus catalog and selecting available dates from the archive. The selected mode is Interferometric Wide Swath (IW) with 250 km swath and 5 m × 20 m spatial resolution [45]. For our model, the pixel resolution is preset at 10 m × 10 m. Remote sensing data combined with GIS data are integrated to create a flood hazard and potential map. Based on information obtained from remote sensing and GIS databases, the ML method can be applied for spatial modeling of flood vulnerability.
The data shown in Figure 4 are divided by the date of acquisition into three categories, namely: pre-event, consists of data from November to December 2019, which represent conditions before the major flood occurred; co-event is data taken on or near the 2 January 2020 flood, and the rest is categorized as post-event data. In Figure 5, the SAR images are set into a dataset, which contains 39 cross-polarized and 39 co-polarized images from Sentinel-1a between November 2019 and October 2020, with co-event images designated as the target image. The dataset SAR was collected using Google Earth Engine and consisted of Sentinel-1a VV and VH images between November 2019 and December 2020 as RGB composite TIF images. All images are resized into 946 × 681 pixels covering the Jakarta area and part of the Bekasi and Tangerang Regencies that flooded. The target image is further broken down as flood markers to make a 25 × 25 pixels-sized kernel. The previous individual images shown in Figure 5 are combined into three-dimensional data with a 946 × 681 × 78 pixel dataset and 25 × 25 × 20 pixel kernel. In this study, the flood data collected came from Sentinel-1a remote sensing satellite. The data are downloaded through Google Earth Engine by the Copernicus catalog and selecting available dates from the archive. The selected mode is Interferometric Wide Swath (IW) with 250 km swath and 5 m × 20 m spatial resolution [45]. For our model, the pixel resolution is preset at 10 m × 10 m. Remote sensing data combined with GIS data are integrated to create a flood hazard and potential map. Based on information obtained from remote sensing and GIS databases, the ML method can be applied for spatial modeling of flood vulnerability.
The data shown in Figure 4 are divided by the date of acquisition into three categories, namely: pre-event, consists of data from November to December 2019, which represent conditions before the major flood occurred; co-event is data taken on or near the 2 January 2020 flood, and the rest is categorized as post-event data. In Figure 5, the SAR images are set into a dataset, which contains 39 cross-polarized and 39 co-polarized images from Sentinel-1a between November 2019 and October 2020, with co-event images designated as the target image. The dataset SAR was collected using Google Earth Engine and consisted of Sentinel-1a VV and VH images between November 2019 and December 2020 as RGB composite TIF images. All images are resized into 946 × 681 pixels covering the Jakarta area and part of the Bekasi and Tangerang Regencies that flooded. The target image is further broken down as flood markers to make a 25 × 25 pixels-sized kernel. The previous individual images shown in Figure 5 are combined into three-dimensional data with a 946 × 681 × 78 pixel dataset and 25 × 25 × 20 pixel kernel.

Image Segmentation and Classification
In a digital image processing application, the primary process is segmentation to detect and identify objects and components within the image. The segmentation process divides the image into parts known as constituent objects. Automatic segmentation is generally the most challenging image processing [12]. With the development of image processing algorithms, image segmentation is also developed using region growing and merging, namely by expanding pixels so that the object becomes larger. In the end, some objects close to the same value will merge into one other, bigger object. This mathematical algorithm is the basis for developing an image segmentation algorithm that carries out the unsupervised segmentation process without human intervention.  Kwak et al. created a SAR satellite data processing algorithm to detect urban floods near-real-time using data before and after a flood event. Furthermore, the image is classified using a supervised classification to obtain the flood area based on building classes. The developed Probability Density Function (PDF) method can reduce the maximum backscatter intensity difference for rice fields and open areas by 35 dB; however, for urban areas, it has increased by 25 dB [9]. Further development of this method can reduce variance by 12 dB and increase urban areas by 15 dB [10]. In comparison, Liang et al. [46] used PDF to estimate the maximum similarity before thresholding by comparing the Otsu, Split-KI (Kettler and Illingworth), and Local Thresholding (LT) methods. The Overall Accuracy (OA) results obtained from the Sentinel-1a image classification in the Louisiana plain were 98.12% (Otsu), 98.55% (Split-KI), and 98.91% (LT), respectively.

Image Segmentation and Classification
In a digital image processing application, the primary process is segmentation to detect and identify objects and components within the image. The segmentation process divides the image into parts known as constituent objects. Automatic segmentation is generally the most challenging image processing [12]. With the development of image processing algorithms, image segmentation is also developed using region growing and merging, namely by expanding pixels so that the object becomes larger. In the end, some objects close to the same value will merge into one other, bigger object. This mathematical algorithm is the basis for developing an image segmentation algorithm that carries out the unsupervised segmentation process without human intervention. Pelich et al. proposed the creation of a large-scale global database for flood inundation maps derived from the SAR dataset [28]. The method used is histogram thresholding to delineate quickly, then the level of flood distribution is extracted from the SAR backscatter using the Probability Density Function (PDF). Thresholding is performed using the Hierarchical Split-Based Approach (HSBA) to identify pixels with a bimodal distribution on the sub-pixels, which indicates that there is an immersion limit on these pixels [47]. The accuracy of the results obtained from flood detection in rural areas is 35%.
Another technique in flood detection is to utilize the polarization characteristics of radar signals, namely the Interferometric SAR (InSAR) method. The principle of stable scatterer or persistent scatterer is used to detect areas that do not experience changes in reflection characteristics, while changes in reflection characteristics result in low coherence between image data and are assumed to be flooded. The mapping is built by creating 20 interferometric pairs from 22 consecutive Sentinel-1a images with a composition of 17 pairs of pre-event images, a pair of images during a flood, and two pairs of postevent images [48]. Chini et al. also integrated intensity data using InSAR coherence, normalized cross-correlation to detect the presence of water in urban areas and mapping of double-bounce-producing objects using histogram thresholding and region growing. Pixels are categorized as floods when there is a decrease in coherence on the RGB composite channel [16].
In line with the development of the field of artificial intelligence, image processing methods also develop by making use of artificial intelligence functions. Several artificial intelligence methods that are widely used in image processing are Artificial Neural Networks or ANNs. The method that has recently begun to be applied in studies of mapping flood potential and vulnerability is to use machine learning. Some of the methods that were implemented include Adaptive Neuro-Fuzzy [34], Support Vector Machine (SVM) [35,36], Convolutional Neural Network (CNN) [36,38], and Swarm Intelligence [39,40,49]. Dasgupta et al. used Gamma Maximum A-Posteriori (GMAP) to filter out speckle noise from SAR images, then performed surface texture analysis using the Gray Level Co-Occurrence Matrix (GLCM) [34].
Although being the most common basic method on flood mapping, NDFI/NDWI as the most straightforward method tends to amplify noise greatly. Otsu thresholding suffers from high computational requirements since it is an early optimization method. The SMUF, SVM, GRNN [34,35], and most recently CNN [50] still perform the classification process in a 2D plane and then perform the 1D multi-temporal process. Due to the complexity of the factors that influence the occurrence of floods in urban areas, the most effective and efficient classification method is needed. As a classification technology developed based on feature matching, the ML method produces a more accurate recognition than feature matching. However, it has limited extraction features that can cause errors in the computation process. This study proposes three-dimensional CNN in mapping flood areas to filter and weigh neurons and map the potential flood areas in urban areas with fewer images compared to the previous study [36,51]. CNN features unsupervised feature extraction compared to Artificial Neural Network (ANN), in which the process is achieved through the training phase to recognize flood areas. In ANN, all neurons of a layer are fully connected to every neuron from other layers, whereas in CNN, only the last layer of neurons is fully connected due to the parameter-sharing nature of the CNN, therefore the computational load of CNN is less than ANN.

Deep Learning Neural Network
Recent developments in the Deep Learning Neural Network (DNN) are increasingly opening up great opportunities in flood mapping research. Deep Learning as one of the Machine Learning models has shown promising results in image processing and pattern recognition. Therefore, this research will propose mapping the potential flood areas using the DNN algorithm. DNN is based on Artificial Neural Network and generally consists of an input layer, with more than one hidden layer and one output layer [52]. Figure 6 shows the conceptual structure of the DNN model used for flood vulnerability mapping. The input layer is the factor that affects flood (F1-Fn). The information is processed and analyzed in the hidden layer to determine the weight and classification of each pixel. The final result of the classification is an indication that there is a flood in the output layer with two possible labels: Flood (positive class) and Others (negative class). the DNN algorithm. DNN is based on Artificial Neural Network and generally consists of an input layer, with more than one hidden layer and one output layer [52]. Figure 6 shows the conceptual structure of the DNN model used for flood vulnerability mapping. The input layer is the factor that affects flood (F1-Fn). The information is processed and analyzed in the hidden layer to determine the weight and classification of each pixel. The final result of the classification is an indication that there is a flood in the output layer with two possible labels: Flood (positive class) and Others (negative class). DNN is a feed-forward network and is trained using the back-propagation method. However, more hidden layers will make the network challenging to train because of the different adjustment speeds in the hidden layer. DNN was implemented successfully in various applications, especially in automatic image recognition, speech recognition, language processing, and some applications in remote sensing. There is no rule of thumb about the number of hidden layers and neurons in each layer since it depends on the complexity of the problem and the conditions of the dataset.
The number of hidden layers in DNN has the advantage of representing a very complex relationship between factors. The hidden layer on DNN has neurons that are activated with the Rectified Linear Unit (ReLU) function as an alternative whose computation is more straightforward when compared to the sigmoid. Because DNN is trained on the principle of back-propagation, ReLU can minimize the decrease in learning gradient, hindering the learning process. Mathematically, the ReLU activation function can be expressed as h'(x) in Equation (1).
Hidden layers in DNN perform increasingly complex feature transformations to produce a more discriminatory feature abstraction. The classification results displayed in the output layer are based on the most abstract features obtained in the last hidden layer. During the DNN learning phase, the connection weights between layers are adjusted to DNN is a feed-forward network and is trained using the back-propagation method. However, more hidden layers will make the network challenging to train because of the different adjustment speeds in the hidden layer. DNN was implemented successfully in various applications, especially in automatic image recognition, speech recognition, language processing, and some applications in remote sensing. There is no rule of thumb about the number of hidden layers and neurons in each layer since it depends on the complexity of the problem and the conditions of the dataset.
The number of hidden layers in DNN has the advantage of representing a very complex relationship between factors. The hidden layer on DNN has neurons that are activated with the Rectified Linear Unit (ReLU) function as an alternative whose computation is more straightforward when compared to the sigmoid. Because DNN is trained on the principle of back-propagation, ReLU can minimize the decrease in learning gradient, hindering the learning process. Mathematically, the ReLU activation function can be expressed as h (x) in Equation (1).
Hidden layers in DNN perform increasingly complex feature transformations to produce a more discriminatory feature abstraction. The classification results displayed in the output layer are based on the most abstract features obtained in the last hidden layer. During the DNN learning phase, the connection weights between layers are adjusted to reduce the difference between observed and predicted results. The back-propagation process trains DNN by providing feedback on the error results to the hidden layer. The deviation between the observed and predicted results is expressed in the loss function between entropies, as expressed in Equation (2).
where N D is the number of training data points, T represents the observed output, and Y represents the predicted output. The back-propagation learning gradient used for the training sample of m is formulated in Equation (3): where L is the loss function, w represents the network weight, and C = 2 represents the number of output classes used (flood and others).

Convolutional Neural Network (CNN)
CNN is one type of DNN that uses the convolutional principle in its data processing. The basic concept of CNN architecture is to utilize a convolutional layer to detect the relationship between the features of objects and a pooling layer to similar group features. The CNN architecture consists of a series of layers, namely the Convolutional Layer (CL), which functions to transform a set of activations with a differential function, a Pooling Layer, and the final result is a Fully Connected Layer (FCL). Unlike other neural networks where all neurons are fully connected with every other neuron of the next layer, CNN disregards zero-valued parameters and makes fewer connections between layers. The non-zero parameters can be shared to be used by more than one connection in the layer to reduce the number of connections. This characteristic is useful for recognizing features.
The pooling layer function is used to reduce the size of an image by downsampling it and summarizing the features. The common pooling methods to achieve grouping are average pooling, where the summary is the dominant feature, and maximum pooling by summarizing the strongest feature [53]. Average pooling produces a smooth feature that is useful to extract the most relevant value, such as the color of a surface, where a small variation in isolated points within a region does not affect the overall value. On the other hand, max-pooling extracts high contrast data, such as edges or points.
The problem with a sampling matrix (and an image) in CL is that pixels at and near the edge are sampled less than pixels farther from the edge. This sometimes results in sampling inaccuracy. To prevent this, the kernel filter is padded, with extra rows and columns to allow for more information to be collected from the edge pixels. For two-dimensional data, there are two types of padding: same padding and valid padding. Same padding maintains the sample size at the same as the original matrix; basically, it resamples the image. Valid padding considers all pixels valid, so the model considers the value. This is useful for keeping the information from corner pixels since the simple model considers it invalid due to being less sampled compared to other pixels.
The extracted features compose the feature map that the FCL will use to classify the result. This approach makes CNN a method with fewer computational requirements than the fully connected ANN structure. The CL calculation is formulated in Equation (4): and the pooling layer (max pooling) is stated in Equation (5): with fully connected layer h formulated in Equation (6): where h i,j is the output at point (i, j) on the layer with input x and filter w, and m denotes the width and height of the filter. Non-linear functions are used in CL and FCL to convert negative values to zero, including Sigmoid, Hyperbolic Tangent (Tanh), and Rectified Linear Unit (ReLU). Three-dimensional CNN is a CNN structure whose input is a set of square matrices, s × (n × n), so it is a suitable method for image segmentation and classification. In this study, the dataset used is multi-sensor, multi-temporal data derived from SAR and optical images, rainfall data, and ground surface contour data, as shown in Figure 7.  7. Representation of the multi-temporal 2D dataset into 3D data.

Proposed Method
Segmentation and classification of flooded areas using 3D CNN for the SAR image dataset and the flood factor consists of three-dimensional dataset segmentation stages using three-dimensional CNN to get initial segmentation results. These results are used to weight neuron connections to perform n-dimensional optimization so that we get the classification of pixels into flood or other categories.
In the three-dimensional CNN shown in Figure 8, several CLs with dimensions a × a × a are used to filter the input data to obtain a feature map. The input data used are shown in Table 1. The images are down sampled using a pooling layer by summarizing the features present in the images. In this model, the pooling layer uses max pooling, which summarizes the most dominant value in the sample. To prevent edge and corner pixels from being omitted by the model, valid padding is used on the input layer and the CL. The padding basically left the image unchanged but allows edge and corner pixels to be more sampled as it is now placed further from the edge. Furthermore, the pooling layer measuring b × b × b is used to reduce the map, so those neuron connections are formed to compile the information obtained, which is then formed into FCL. FCL stored the different feature values and compiled them into a feature map with two output categories, namely flood pixels and non-flood pixels. store known flood data. The commonly used proportion between the two sets is 70:30 [54], but we also include 80:20 and 90:10 for comparisons. Training data are used to train 3D CNN [36] in determining the parameters' optimized values. The next stage is to conduct training on the classification by three-dimensional CNN to detect the presence of water surfaces and differentiate them from other surfaces by the variance of the pixel values since dry land and permanent water bodies have consistent values. The ReLU plays a significant part in this phase since flood areas tend to change values, the possibility of dry land changes to a water surface and then back to dry land will result in a negative value. The ReLU rectifies this problem and prevents the neuron with a negative output from being contributed to the network. The Classification stage presents the system to other data for recognizing if there are flood features present in the images using feedback from the results of the Training stage. The overall process in the research is shown in Figure 9.   The stages carried out in this study began with an inventory of the data used for classification, namely the SAR image dataset. The pre-processing stage is comprised of registering the image data to ensure that the coordinates are consistent between different images. As the images are in RGB TIF format with r × c × 3 dimensions, they must be converted first into grayscale images, and then samples of sub-images were selected that represent flood and non-flood targets. The data are then divided into training and validation sets. The Feature Learning stage, or training, provides training data for the model to store known flood data. The commonly used proportion between the two sets is 70:30 [54], but we also include 80:20 and 90:10 for comparisons. Training data are used to train 3D CNN [36] in determining the parameters' optimized values. The next stage is to conduct training on the classification by three-dimensional CNN to detect the presence of water surfaces and differentiate them from other surfaces by the variance of the pixel values since dry land and permanent water bodies have consistent values. The ReLU plays a significant part in this phase since flood areas tend to change values, the possibility of dry land changes to a water surface and then back to dry land will result in a negative value. The ReLU rectifies this problem and prevents the neuron with a negative output from being contributed to the network. The Classification stage presents the system to other data for recognizing if there are flood features present in the images using feedback from the results of the Training stage. The overall process in the research is shown in Figure 9.
land changes to a water surface and then back to dry land will result in a negative val The ReLU rectifies this problem and prevents the neuron with a negative output fr being contributed to the network. The Classification stage presents the system to oth data for recognizing if there are flood features present in the images using feedback fro the results of the Training stage. The overall process in the research is shown in Figure   Figure 8. Representation of 3D-CNN process.

Results and Discussion
The three-dimensional CNN model is trained with two main hyperparameters, namely: epoch, which is the complete iteration of convolution feed-forward before starting over the next iteration; and validation-split, which is the proportion of the training data used for validating the result of the training. In this research, we use the combination of training/validation split of 70/30, 80/20, and 90/10 with epochs of 100 and 150 iterations. The elapsed time and resulting accuracy for each combination are shown in Table 2 and the graphic plot in Figure 10. Accuracy is defined as the percentage of correct predictions for the test data calculated by dividing the number of correct predictions by the number of total predictions, while elapsed time counts the total time needed to perform the training with the corresponding proportion and epochs.

Results and Discussion
The three-dimensional CNN model is trained with two main hyperparameters, namely: epoch, which is the complete iteration of convolution feed-forward before starting over the next iteration; and validation-split, which is the proportion of the training data used for validating the result of the training. In this research, we use the combination of training/validation split of 70/30, 80/20, and 90/10 with epochs of 100 and 150 iterations. The elapsed time and resulting accuracy for each combination are shown in Table 2 and the graphic plot in Figure 10. Accuracy is defined as the percentage of correct predictions for the test data calculated by dividing the number of correct predictions by the number of total predictions, while elapsed time counts the total time needed to perform the training with the corresponding proportion and epochs.    Figure 10 shows that the validation accuracy quickly becomes stable after 20 to 25 epochs while the training accuracy is still increasing until it reaches 100%. This condition indicates that the model was overfitting during testing. Overfitting resulted from a vast set of neural connections, which often reduces the system fitness due to non-common cases included in learning data [55]. We readjusted the model to eliminate and reduce overfitting and then tested it with similar hyperparameters. Overfitting correction is performed by randomly deactivating some neurons on each layer, so they are not used during forward-and back-propagation training. This causes the learning process to spread out connection weights without focusing on specific neurons. In this research, the deactivation probability is set at 0.5, which means there is an equal chance of each unused neuron in the learning process. Low deactivation probability will not reduce overfitting, while high probability will cause the system to underachieve. Reducing neurons results in a smaller, simpler, and more regulated connection network, which means outlying or widely different results will be disregarded. In this manner, the overall error could be reduced by averaging errors from different connections.
The adjusted model yields the result shown in Table 3 and Figure 11. The results indicate that computation time takes 50 min longer for an 80/20 and 90/10 split, with the resulting accuracy reaching over 0.7 than the previous test. The most significant increase in accuracy is for 150 epochs with a 90/10 split of testing and validation data, which shows an increase of 0.045 for accuracy of 0.719, the lowest RMSE achieved by 70/30 split with 100 epochs with a drop of RMSE value from 0.284 to 0.024. The fastest computing time of 165 min is achieved with 100 epochs and 70/30 split data. This result is consistent that fewer training data corresponds with faster computing but lower accuracy, while a higher percentage of training data took longer but with higher accuracy.  Figure 10 shows that the validation accuracy quickly becomes stable after 20 to 25 epochs while the training accuracy is still increasing until it reaches 100%. This condition indicates that the model was overfitting during testing. Overfitting resulted from a vast set of neural connections, which often reduces the system fitness due to non-common cases included in learning data [55]. We readjusted the model to eliminate and reduce overfitting and then tested it with similar hyperparameters. Overfitting correction is performed by randomly deactivating some neurons on each layer, so they are not used during forward-and back-propagation training. This causes the learning process to spread out connection weights without focusing on specific neurons. In this research, the deactivation probability is set at 0.5, which means there is an equal chance of each unused neuron in the learning process. Low deactivation probability will not reduce overfitting, while high probability will cause the system to underachieve. Reducing neurons results in a smaller, simpler, and more regulated connection network, which means outlying or widely different results will be disregarded. In this manner, the overall error could be reduced by averaging errors from different connections.
The adjusted model yields the result shown in Table 3 and Figure 11. The results indicate that computation time takes 50 min longer for an 80/20 and 90/10 split, with the resulting accuracy reaching over 0.7 than the previous test. The most significant increase in accuracy is for 150 epochs with a 90/10 split of testing and validation data, which shows an increase of 0.045 for accuracy of 0.719, the lowest RMSE achieved by 70/30 split with 100 epochs with a drop of RMSE value from 0.284 to 0.024. The fastest computing time of 165 min is achieved with 100 epochs and 70/30 split data. This result is consistent that fewer training data corresponds with faster computing but lower accuracy, while a higher percentage of training data took longer but with higher accuracy. Further testing with 140, 145, 155, and 160 epochs to investigate the optimum combination of accuracy with shorter time yields the results shown in Figure 12. Since the testing accuracy is greater with 150 epochs than 100 epochs, we assume that accuracy will improve within 150 ± 10 epochs. Testing with a 70/30 split confirms that as epochs increased from 140 to 160, accuracy gradually improved by 13.6% from 0.567 to 0.703, while computing time increased by 12% from 240 min to 269 min. A similar trend is also observed during testing using an 80/20 data split with an accuracy increase by 13.5% from 0.577 to 0.712 but with a much longer computing time from 243 min to 304 min, representing an increase of 25.1%. The more significant increase is due to the additional time needed to perform more training for the 80/20 than the 70/30. As for the testing with a 90/10 data split, the peak accuracy performance is achieved at 150 epochs with 0.719. Testing with 140, 145, 155, and 160 epochs gives lower results.
In Table 4, the three-dimensional CNN without any combinations with other methods results in higher accuracy than what was achieved by Wang et al. at 0.685 [36]. It is comparable to Grimaldi et al. [11] on open trees flood accuracy at a range of 0.55 to 0.70, which is similar in conditions to flooded areas in Jakarta. Figure 12 shows the flood map of the proposed model compared to the SAR image, as shown in Figure 2, where the model could detect most of the dark areas of the flood while leaving out the similarly dark Jakarta Bay. Compared to the sub-district-level flood map publicly released by the government [2], it is also shown that the flood has occurred in the reported sub-districts. There are discrepancies between the detected and reported areas since the report classifies floods as a whole sub-district coverage.  Further testing with 140, 145, 155, and 160 epochs to investigate the optimum combination of accuracy with shorter time yields the results shown in Figure 12. Since the testing accuracy is greater with 150 epochs than 100 epochs, we assume that accuracy will improve within 150 ± 10 epochs. Testing with a 70/30 split confirms that as epochs increased from 140 to 160, accuracy gradually improved by 13.6% from 0.567 to 0.703, while computing time increased by 12% from 240 min to 269 min. A similar trend is also observed during testing using an 80/20 data split with an accuracy increase by 13.5% from 0.577 to 0.712 but with a much longer computing time from 243 min to 304 min, representing an increase of 25.1%. The more significant increase is due to the additional time needed to perform more training for the 80/20 than the 70/30. As for the testing with a 90/10 data split, the peak accuracy performance is achieved at 150 epochs with 0.719. Testing with 140, 145, 155, and 160 epochs gives lower results.
(a) In Table 4, the three-dimensional CNN without any combinations with other methods results in higher accuracy than what was achieved by Wang et al. at 0.685 [36]. It is comparable to Grimaldi et al. [11] on open trees flood accuracy at a range of 0.55 to 0.70, which is similar in conditions to flooded areas in Jakarta. Figure 12 shows the flood map of the proposed model compared to the SAR image, as shown in Figure 2, where the model could detect most of the dark areas of the flood while leaving out the similarly dark Jakarta Bay. Compared to the sub-district-level flood map publicly released by the government [2], it is also shown that the flood has occurred in the reported sub-districts. There are discrepancies between the detected and reported areas since the report classifies floods as a whole sub-district coverage.

Conclusions
In this study, an application of a three-dimensional Convolutional Neural Network for flood mapping is proposed. The deactivation factor minimizes the overfitting problem to reduce the number of neurons and simplify the connections. The research results are that the 3D-CNN method enables the analysis of multi-temporal images for flood detection and classification instead of using multiple image pairs with multiple classification levels. For three combinations of splitting training/test data, the highest overall accuracy of 0.72 was achieved for a split of 90/10 and 150 epochs in 302 min. Regarding computation time, the best performance is achieved with an 80/20 split and 150 epochs with an accuracy

Conclusions
In this study, an application of a three-dimensional Convolutional Neural Network for flood mapping is proposed. The deactivation factor minimizes the overfitting problem to reduce the number of neurons and simplify the connections. The research results are that the 3D-CNN method enables the analysis of multi-temporal images for flood detection and classification instead of using multiple image pairs with multiple classification levels. For three combinations of splitting training/test data, the highest overall accuracy of 0.72 was achieved for a split of 90/10 and 150 epochs in 302 min. Regarding computation time, the best performance is achieved with an 80/20 split and 150 epochs with an accuracy of 0.71 in 283 min. Another test with epochs other than 150 showed that accuracy gradually decreases with a 90/10 split, but with a lower training function, the accuracy improves as the number of epochs increases.