Tropical Cyclone Intensity Estimation Using Multi-Dimensional Convolutional Neural Networks from Geostationary Satellite Data

: For a long time, researchers have tried to ﬁnd a way to analyze tropical cyclone (TC) intensity in real-time. Since there is no standardized method for estimating TC intensity and the most widely used method is a manual algorithm using satellite-based cloud images, there is a bias that varies depending on the TC center and shape. In this study, we adopted convolutional neural networks (CNNs) which are part of a state-of-art approach that analyzes image patterns to estimate TC intensity by mimicking human cloud pattern recognition. Both two dimensional-CNN (2D-CNN) and three-dimensional-CNN (3D-CNN) were used to analyze the relationship between multi-spectral geostationary satellite images and TC intensity. Our best-optimized model produced a root mean squared error (RMSE) of 8.32 kts, resulting in better performance (~35%) than the existing model using the CNN-based approach with a single channel image. Moreover, we analyzed the characteristics of multi-spectral satellite-based TC images according to intensity using a heat map, which is one of the visualization means of CNNs. It shows that the stronger the intensity of the TC, the greater the inﬂuence of the TC center in the lower atmosphere. This is consistent with the results from the existing TC initialization method with numerical simulations based on dynamical TC models. Our study suggests the possibility that a deep learning approach can be used to interpret the behavior characteristics of TCs.


Introduction
On-going climate change makes natural disasters unpredictable. The Intergovernmental Panel on Climate Change (IPCC) special report said that the global warming over 2 • C leads to an increase of heavy rainfall frequency and a decrease in the occurrence of tropical cyclones (TCs), but an increase of the number of strong TCs. It makes people hard to prepare natural disasters such as TCs, which cause huge damage to human beings and infrastructure [1,2]. According to World bank annual report 2012, behavioral changes in TCs could cause a direct increase in the economic damage they create, from $28 billion in 2010 to $68 billion by 2100 [3,4]. East Asia is one of the regions vulnerable to natural disasters, where approximately 30% of the global economic damage caused by TCs occurs [5]. In order to better prepare for and respond to TC disasters, the quick identification of TC intensity is crucial, as well as accurate forecasting.
However, it is difficult to monitor TCs using ground-based observations because TCs generally occur and develop in the middle of the ocean. The development of meteorological satellite sensor systems has opened a new era in climate forecasting and meteorological observations. High temporal resolution geostationary satellite data are considered to be one of the most reliable means of observing TCs in real-time, producing information on the various characteristics of TCs, such as intensity and center location [6][7][8]. A widely used method to determine TC intensity is the Dvorak technique [9]. It is a manual algorithm, which determines the intensity of TCs based on empirical satellite image analysis. The Dvorak technique has been used for extracting the scale of TCs and establishing relevant damage recovery policies. However, due to its subjectivity, the reliability of the real-time intensity readings based on the algorithm is inevitably low [10][11][12][13].
To overcome this limitation, many researchers have proposed objective TC intensity estimation algorithms. Velden et al. [14] proposed the objective Dvorak technique (ODT), which is a computer-based algorithm that uses satellite images. The coldest ring of a cyclone, which has the lowest brightness temperature within a certain radius around the eye of the cyclone, is used for estimating the TC's intensity. It is verified using the minimum sea level pressure (MSLP) to represent the intensity of the TC, resulting in a root mean square error (RMSE) of 11.45 hPa. Olander et al. [10] proposed an advanced Dvorak technique (ADT), which is an improved version of ODT that uses specific physical conditions for not only TCs with eyes but also non-eyed weak TCs. ADT performed about 20% better than the ODT method. Pineros et al. [15] introduced a systematic interpretation method using satellite infrared images of TCs and Ritchie et al. [16] proposed the deviation-angle variance technique (DAV-T) based on the structural analysis of TCs. DAV-T quantifies the trend of pixel-based brightness temperatures toward the center of a TC and determines its intensity using a degree of concentration in the center location. Their estimation model was tested based on mean sustained wind speed (MSW) as the reference intensity of TCs with validation using the best track data issued by the Joint Typhoon Warning Center (JTWC). The DAV-T showed an RMSE of 12.7 kts for TCs that occurred in the northwestern Pacific from 2005 to 2011.
More recently, convolutional neural networks (CNNs), which are one of the deep learning techniques in artificial intelligence, have been used to analyze satellite-based TC images to estimate their intensity. Pradhan et al. [17] estimated the intensity of TCs using single IR images based on CNNs, resulting in an RMSE of 10.18 kts. Combinido et al. [18] adopted the Visual Geometry Group 19-layer (VGG19) model for estimating TC intensity, which is a well-performing 2D-CNN architecture for image analysis proposed by Simonyan et al. [19]. They used single IR TC images from multiple geostationary satellite sensors from 1996 to 2016 over the Western North Pacific to develop the model, resulting in an RMSE of 13.23 kts. Wimmers et al. [20] used satellite-based passive microwave sensors to estimate TC intensity using the 2D-CNN approach. They used 37, 85-92 GHz channels to extract TC images as input data. The model shows a validation RMSE of 14.3 kts.
In this study, we benchmarked the existing TC intensity estimation methods based on CNNs and proposed improved CNN models using geostationary satellite-based multi-spectral images, which adopt a multi-dimensional approach, considering the vertical structure of TCs. Whereas existing methods using geostationary satellites consider single infrared channels for extracting cloud top pattern of TCs, we attempted to incorporate the three-dimensional asymmetrical structure of TCs caused by vertical wind shear which affects the intensity of TCs [21]. The horizontal and vertical TC patterns were analyzed by multi-dimensional CNNs which have shown marvelous performance in image pattern recognition and remote sensing [22][23][24][25][26]. Multi-channels input data were used in the proposed CNN models for estimating TC intensity. The proposed approach uses multi-sensor based satellite images to show the correlation between the shape and intensity of TCs with consideration for the water vapor diameters constituting TCs. The objectives of this research were to 1) propose an objective TC intensity estimation algorithm based on CNN approaches using geostationary satellite data, 2) identify the best CNN model with optimized parameters adapted to estimate TC intensity using satellite-based multi-spectral images, and 3) examine the significant impact of the vertical relative distribution of water vapor on TC intensity estimation using heat maps, which is one of the visualization methods for interpreting CNN-based deep learning models.

Geostationary Meteorological Satellite Sensor Data
We used Communication, Ocean, and Meteorological Satellite (COMS) Meteorological Imager (MI) sensor data to estimate the intensity of TCs. COMS is the first Korean geostationary meteorological satellite and was launched in 2010. It is stationed at the longitude of 128.2 • E and 36,000 km above the earth equator [27]. The COMS MI sensor observes every 15 min over one side of the Earth with a horizontal spatial resolution of 1 km to 4 km. The sensor consists of five spectral channels-one visible channel and four infrared channels (Table 1). The infrared channels are widely used for deriving cloud information such as water vapor content of atmospheric layers. The MI sensor has multi-spectral channels from a short-wave infrared channel of 3.7 µm to a long wavelength channel of 12.0 µm. The long-wavelength channels with infrared 1 (IR1, 10.8 µm) and infrared 2 (IR2, 12.0 µm) are sensitive to the water vapor contents in the upper atmosphere [6,14,28]. The water vapor channel (WV, 6.7 µm) provides middle atmospheric components [29]. The shortwave infrared (SWIR, 3.7 µm) is widely used for detecting low clouds. Because it has lower brightness temperature as the diameter of the droplet in the atmosphere increases, the value variation is larger than that of the longwave channel [30,31]. Rosenfeld et al. [32] proposed the vertical profile of the cloud drop effective radius in a severe convective storm through satellite observation-based simulation. In severe storms, the higher the altitude of the atmosphere, the larger effective radius (i.e., effective radius of 1-7 µm at the atmospheric altitude < 2 km to 10-30 µm at the atmospheric height > 6 km). In addition, the water droplet-content-rate goes higher with a higher atmospheric altitude in severe storms. Since the WV and SWIR channel signals were related to the lower atmosphere, they could be considered as the cloud droplet distribution with a smaller effective radius in a lower altitude of the atmosphere [32,33].

Best Track Data
JTWC produces tropical cyclone data known as "best track" data which is from the International Best Track Archives for Climate Stewardship (IBTrACS). It includes the location of the tropical cyclone center (degrees), maximum sustained wind speed (kts), minimum sea level pressure (hPa) and tropical cyclone radius for the Southern Hemisphere (SH), the Northern Indian Ocean (NIO) and the Western North Pacific (WNP) regions. The annually-organized best track data with a 6-hour interval are officially provided 6 months to 1 year after the previous year due to their post-processing using corrected observational data and numerical model results [34][35][36]. Our research was conducted using the TCs generated in the WNP region within the observation range of the COMS MI sensor. There are the most frequent occurrences and the largest lifetime maximum intensity (LMI) variations of TCs over this region.

Input Data Preparation
TC images used to develop the estimation models were extracted from four infrared channels of COMS MI. Since tropical cyclone eyewalls, the shape of spiral rain-bands formed by the cirrus outflow, and vertical wind shear are all crucial structural factors for estimating intensity [7], it is necessary to use an image that covers the whole shape of a TC as an input for training the patterns of TCs. We delineated one 301 × 301-pixel image (i.e., 1204 km × 1204 km) per TC based on the grid of each TC center location from JTWC best track data. This is illustrated using COMS in Figure 1. The delineated input images were scaled up to 101 × 101 pixel images based on bilinear interpolation using the image resize tool available in MATLAB 2018a for computational efficiency so that the input images had a horizontal spatial resolution of about 12 km.

Input Data Preparation
TC images used to develop the estimation models were extracted from four infrared channels of COMS MI. Since tropical cyclone eyewalls, the shape of spiral rain-bands formed by the cirrus outflow, and vertical wind shear are all crucial structural factors for estimating intensity [7], it is necessary to use an image that covers the whole shape of a TC as an input for training the patterns of TCs. We delineated one 301×301-pixel image (i.e., 1204 km x 1204 km) per TC based on the grid of each TC center location from JTWC best track data. This is illustrated using COMS in Figure 1. The delineated input images were scaled up to 101×101 pixel images based on bilinear interpolation using the image resize tool available in MATLAB 2018a for computational efficiency so that the input images had a horizontal spatial resolution of about 12 km. The images are upscaled to 101 × 101 pixels to reduce computational demand. Since tropical cyclones are larger scaled phenomena than cloud clusters, tropical cyclone patterns, such as spiral rain bands or eyewall strength pattern, clearly appear in the upscaled images.
To train CNN-based intensity estimation models, we used COMS MI images of TCs which developed over the WNP from 2011 to 2016 based on the best track data. Since the track data consists of accumulating the intensity of the TC lifetime, it causes an imbalance problem in training data in terms of intensity. Approximately 70% of the TCs have an intensity fewer than 63 kts, whereas only 12% show intensity of 96 kts or more, which is likely to cause devastating damage. The data imbalance problem leads to overfitting major samples and poorly estimates minority samples [37]. To overcome that, we balanced our dataset through subsampling and oversampling processes.
Prior to these processes, the best track data between 2011 and 2016 were randomly divided into 6:2:2 for training, test, and validation data, respectively, and only training and test data were balanced according to intensity for unbiased training and parameterization. To balance our dataset, we removed the samples with a high frequency of intensity. Using the 10 kts interval-based histogram, we randomly removed the samples on some bins which had more than 25% frequency. Then, the data of the other bins were augmented through two oversampling processes: hourly interpolation and image rotation. The 6-hour-interval TC tracks were interpolated into hourly data during the subsampling process, except for the randomly removed ones ( Figure 2). The distribution of temporally interpolated hourly data was still imbalanced due to a minority of high-intensity samples. To balance the dataset, the extracted images were augmented by rotating them to various angles. The smaller the number of binned data, the more images were augmented with smaller angles. In addition, four major TCs developed in 2017 (i.e., 2017 Typhoon HATO, KAHNUN, LAN, and To train CNN-based intensity estimation models, we used COMS MI images of TCs which developed over the WNP from 2011 to 2016 based on the best track data. Since the track data consists of accumulating the intensity of the TC lifetime, it causes an imbalance problem in training data in terms of intensity. Approximately 70% of the TCs have an intensity fewer than 63 kts, whereas only 12% show intensity of 96 kts or more, which is likely to cause devastating damage. The data imbalance problem leads to overfitting major samples and poorly estimates minority samples [37]. To overcome that, we balanced our dataset through subsampling and oversampling processes. Prior to these processes, the best track data between 2011 and 2016 were randomly divided into 6:2:2 for training, test, and validation data, respectively, and only training and test data were balanced according to intensity for unbiased training and parameterization. To balance our dataset, we removed the samples with a high frequency of intensity. Using the 10 kts interval-based histogram, we randomly removed the samples on some bins which had more than 25% frequency. Then, the data of the other bins were augmented through two oversampling processes: hourly interpolation and image rotation. The 6-hour-interval TC tracks were interpolated into hourly data during the subsampling process, except for the randomly removed ones ( Figure 2). The distribution of temporally interpolated hourly data was still imbalanced due to a minority of high-intensity samples. To balance the dataset, the extracted images were augmented by rotating them to various angles. The smaller the number of binned data, the more images were augmented with smaller angles. In addition, four major TCs developed in 2017 (i.e., 2017 Typhoon HATO, KAHNUN, LAN, and DAMREY) were used for additional hindcast validation of our proposed models. Finally, a total of 49,420 balanced samples were prepared. Table 2 shows the difference in the number of samples before and after data adjustment.
Remote Sens. 2019, 11, x FOR PEER REVIEW 5 of 26 DAMREY) were used for additional hindcast validation of our proposed models. Finally, a total of 49,420 balanced samples were prepared. Table 2 shows the difference in the number of samples before and after data adjustment.  Based on the preprocessed data, four infrared images of TCs were extracted and stacked to construct the input data. Each channel of input data was normalized from 0 to 1 in order to focus on the pattern of clouds [38]. Figure 3 shows an example of the difference in the convective pattern distribution by wavelength. Each wavelength depicts a pattern that is affected by the wavelengthcorresponding particle size. That is, the stacking of the multi-infrared images shows the relative difference in droplet particle size. Therefore, we used the stacked dataset as an indicator of relative  Based on the preprocessed data, four infrared images of TCs were extracted and stacked to construct the input data. Each channel of input data was normalized from 0 to 1 in order to focus on the pattern of clouds [38]. Figure 3 shows an example of the difference in the convective pattern distribution Remote Sens. 2020, 12, 108 6 of 24 by wavelength. Each wavelength depicts a pattern that is affected by the wavelength-corresponding particle size. That is, the stacking of the multi-infrared images shows the relative difference in droplet particle size. Therefore, we used the stacked dataset as an indicator of relative TC formation.

Convolutional Neural Networks (CNNs)
CNNs are one of the hierarchical neural network systems which use feature vector extraction to make multi-dimensional complex data analyzable through dimensional reduction [39][40][41]. Thus, it has been widely adopted for recognizing visualization data such as handwriting, photos, and medical images [27,[42][43][44][45][46][47]. Recently, it has been used in climate-related research to recognize the pattern of numerical model results, reanalysis data, and satellite-based observations [29,48,[49][50][51]. CNNs consist of three major parts: the convolutional layers, pooling layers, and the fully connected layer. The information from the input data pattern is extracted in the convolutional layers and the data dimension decreases through the pooling layers. After that, the extracted feature determines the output based on the extracted values in the fully connected layers. In the convolutional layers, weights and biases are compared in each layer and any that are shared are optimized based on a back-propagation process. Subsequently, the last convolutional layer is flattened so that it can be applied to the fully connected layer to train the features extracted from the convolutional layers [22,42,52].
In this study, we used two types of CNNs for estimating TC intensity. One is two-dimensional (2D)-CNNs and the other is three-dimensional (3D)-CNNs. Figure 4 summarizes how the calculations in the convolutional layers are conducted showing the difference between the two methods (i.e., 2Dand 3D-CNNs). The major difference between the two is the dimension of the kernel applied to each convolutional layer. In 2D-CNNs, an activation map, which is the output of each convolutional layer, is calculated as the sum of the layers applied with the kernel and bias of each layer. 3D-CNNs, however, use one-dimensional level higher than 2D-CNNs in terms of hyper-parameters, such as kernels and pooling layers, which are effective in multi-dimensional data [19,44]. The threedimensional kernel and bias are applied to the segmented inputs. Since kernels are convoluted threedimensionally in 3D-CNNs, the output can have a multi-channel, which enables the information from each input channel to be preserved. If the user defines the kernel depth of a 3D-CNN model as 1, the number of activation map bundles corresponding to the number of the input data depth is generated. On the other hand, 2D-CNNs generate as many activation maps as kernels in each convolutional layer. As one convolutional layer passes, a 2D-CNN model has a set of as many activation maps as the number of kernels. However, a 3D-CNN can generate multiple times as many activation maps as the number of kernels, which results in a significant increase in computational demand. Whereas a

Convolutional Neural Networks (CNNs)
CNNs are one of the hierarchical neural network systems which use feature vector extraction to make multi-dimensional complex data analyzable through dimensional reduction [39][40][41]. Thus, it has been widely adopted for recognizing visualization data such as handwriting, photos, and medical images [27,[42][43][44][45][46][47]. Recently, it has been used in climate-related research to recognize the pattern of numerical model results, reanalysis data, and satellite-based observations [29,[48][49][50][51]. CNNs consist of three major parts: the convolutional layers, pooling layers, and the fully connected layer. The information from the input data pattern is extracted in the convolutional layers and the data dimension decreases through the pooling layers. After that, the extracted feature determines the output based on the extracted values in the fully connected layers. In the convolutional layers, weights and biases are compared in each layer and any that are shared are optimized based on a back-propagation process. Subsequently, the last convolutional layer is flattened so that it can be applied to the fully connected layer to train the features extracted from the convolutional layers [22,42,52].
In this study, we used two types of CNNs for estimating TC intensity. One is two-dimensional (2D)-CNNs and the other is three-dimensional (3D)-CNNs. Figure 4 summarizes how the calculations in the convolutional layers are conducted showing the difference between the two methods (i.e., 2D-and 3D-CNNs). The major difference between the two is the dimension of the kernel applied to each convolutional layer. In 2D-CNNs, an activation map, which is the output of each convolutional layer, is calculated as the sum of the layers applied with the kernel and bias of each layer. 3D-CNNs, however, use one-dimensional level higher than 2D-CNNs in terms of hyper-parameters, such as kernels and pooling layers, which are effective in multi-dimensional data [19,44]. The three-dimensional kernel and bias are applied to the segmented inputs. Since kernels are convoluted three-dimensionally in 3D-CNNs, the output can have a multi-channel, which enables the information from each input channel to be preserved. If the user defines the kernel depth of a 3D-CNN model as 1, the number of activation map bundles corresponding to the number of the input data depth is generated. On the other hand, 2D-CNNs generate as many activation maps as kernels in each convolutional layer. As one convolutional layer passes, a 2D-CNN model has a set of as many activation maps as the number of kernels. However, a 3D-CNN can generate multiple times as many activation maps as the number of kernels, which results in a significant increase in computational demand. Whereas a low-dimensional CNN is more adept at simplifying and effectively training input data through the aggregation of the information as a convolutional layer passes by, a high-dimensional CNN could train the model while preserving more information from each channel. However, a high-dimensional CNN has a disadvantage in terms of the complexity of the model including parameter optimization.
Remote Sens. 2019, 11, x FOR PEER REVIEW 7 of 26 low-dimensional CNN is more adept at simplifying and effectively training input data through the aggregation of the information as a convolutional layer passes by, a high-dimensional CNN could train the model while preserving more information from each channel. However, a high-dimensional CNN has a disadvantage in terms of the complexity of the model including parameter optimization. . The size of the output is determined by the filter and pooling layer size. Figure 5 shows the basic architectures of 2D-CNN and 3D-CNN models proposed in this study. Due to the high dimensional level of the 3D-CNN model, it takes much more processing time with more parameters to be optimized when compared to the 2D-CNN with similar hyper-parameters: horizontal kernel size and pooling layers. A large number of parameters causes the model to become complicated, which often results in an overfitting problem [53]. Nevertheless, the advantage of 3D-CNN is that it can keep the characteristics of each channel according to the user-defined filters, while 2D-CNN combines the information that passes on a convolutional layer. Previous studies on estimating TC intensity using satellite data have used long-wavelength infrared images at about 11 μm, which can be used to observe the cloud top pattern. The studies have typically used 2D-CNN models to analyze single-spectral TC images [17,18]. In this study, we used multi-spectral infrared images from short-wavelength at 3.7 μm to long-wavelength at 12.0 μm that were derived from a meteorological satellite sensor in a multi-dimensional CNN framework. The multi-dimensional CNN such as 3D-CNN has been effectively used for image analysis considering the three-dimensional shape of the object under investigation or the time-series of scenes [44,54,55]. The multi-layered input data were analyzed using both 2D-CNN and 3D-CNN to consider the characteristics of each channel. The CNN experiments were conducted using Tensorflow of Keras deep learning framework in Python. The proposed networks were trained in a GPU environment which provided more cost-  Figure 5 shows the basic architectures of 2D-CNN and 3D-CNN models proposed in this study. Due to the high dimensional level of the 3D-CNN model, it takes much more processing time with more parameters to be optimized when compared to the 2D-CNN with similar hyper-parameters: horizontal kernel size and pooling layers. A large number of parameters causes the model to become complicated, which often results in an overfitting problem [53]. Nevertheless, the advantage of 3D-CNN is that it can keep the characteristics of each channel according to the user-defined filters, while 2D-CNN combines the information that passes on a convolutional layer. Previous studies on estimating TC intensity using satellite data have used long-wavelength infrared images at about 11 µm, which can be used to observe the cloud top pattern. The studies have typically used 2D-CNN models to analyze single-spectral TC images [17,18]. In this study, we used multi-spectral infrared images from short-wavelength at 3.7 µm to long-wavelength at 12.0 µm that were derived from a meteorological satellite sensor in a multi-dimensional CNN framework. The multi-dimensional CNN such as 3D-CNN has been effectively used for image analysis considering the three-dimensional shape of the object under investigation or the time-series of scenes [44,54,55]. The multi-layered input data were analyzed using both 2D-CNN and 3D-CNN to consider the characteristics of each channel. The CNN experiments were conducted using Tensorflow of Keras deep learning framework in Python. The proposed networks were trained in a GPU environment which provided more cost-effective calculations than a CPU [56][57][58][59]. We used a Volta GPU (NVIDIA Tesla V100) which has 5,376 CUDA cores and 16 GB of memory. effective calculations than a CPU [56][57][58][59]. We used a Volta GPU (NVIDIA Tesla V100) which has 5,376 CUDA cores and 16 GB of memory. the major distinction between the two is that the three-dimensional kernels are applied in the 3D-CNN model. Whereas three-dimensional kernels enable the preservation of the difference between channels, the computational load is much more significant than the 2D-CNN model for training the parameters.

Optimization and Schemes
A CNN model is optimized by a set of hyper-parameters, such as the depth of convolutional layers, size and number of filters, and scale of the pooling layer. In particular, the convolutional depth and filter size of each layer is sensitive to the characteristics of the input data. The smaller the filter 3D-CNN model architecture used in this study-the major distinction between the two is that the three-dimensional kernels are applied in the 3D-CNN model. Whereas three-dimensional kernels enable the preservation of the difference between channels, the computational load is much more significant than the 2D-CNN model for training the parameters.

Optimization and Schemes
A CNN model is optimized by a set of hyper-parameters, such as the depth of convolutional layers, size and number of filters, and scale of the pooling layer. In particular, the convolutional depth and filter size of each layer is sensitive to the characteristics of the input data. The smaller the filter size, the more the model is able to catch the local characteristics of the input images. On the other hand, large filter size is suitable for getting the general pattern of the input images. Whereas a small filter extracts lots of information from the input data, it slows down the dimensional reduction, which may require training through a deeper convolutional layer [22]. Therefore, it is important to find an optimum filter size and convolutional layer depth appropriate for the characteristics of the satellite-based TC images. Several schemes were tested with hyper-parameters adjustments to find an optimum model. Table 3 shows the schemes with model architectures that were evaluated in this study. The "Control" model mimics the model proposed by Pradhan et al. [17] (i.e., CNN based TC intensity estimation algorithm with a single-spectral channel (IR1)-based TC images), which was tested with our dataset. The "Control4channels" model has the exact same architecture as the "Control" model except for the depth of the input data with multi-spectral channels (i.e., IR2, IR1, WV, and SWIR). A comparison of the two models shows the difference in model performance according to the type of input dataset. The names of the other models (i.e., IDs) consist of the CNN type and number. All the models except the "Control" model use a four-layered TC dataset. Table 3. Details of CNN model architectures. C, P and FC means a convolutional layer, pooling layer and fully connected layer, respectively. The meaning of the numbers in each abbreviation is as follows; C(# of filters)@(size of filter(horizontal size * vertical size)), P(size of pooling layer (horizontal size * vertical size)). Each convolutional layer has a ReLu activation function and all the models are optimized using Adam optimizer with β = 0.999, ε = 1 × 10 −6 .

Accuracy Assessment
Model performances were evaluated and compared using various statistical indices: mean absolute error (MAE), root mean squared error (RMSE), relative root mean squared error (rRMSE), mean error (ME), mean percentage error (MPE), and Nash-Sutcliffe model efficiency coefficient (NSE).
where n is the number of samples, and f x and a x mean the estimated and actual intensity values for each sample, respectively. Whereas RMSE and MAE were used to document the degree of absolute errors in the modeling results, ME and MPE were used as indicators of overestimation (positive) and underestimation (negative) in the models. rRMSE is calculated with RMSE divided by the average of the references. According to Li et al. [60], an rRMSE less than 30% means that the model has fair performance. NSE shows how a model fits observations. Whereas a correlation coefficient assumes that data are unbiased, NSE proposed by Nash et al. [61] can be adopted for various models including nonlinear models [62]. Moriasi et al. [63] suggested four performance levels according to NSE values: unsatisfactory (NSE≤0.5); satisfactory (0.5<NSE≤0.65); good (0.65<NSE≤0.75); and very good (0.75<NSE≤1.0). These statistical values were used to evaluate our CNN-based models. RMSE and MAE show the overall accuracy of the models, which were used to select the most optimized version of each CNN-based model. The selected models were validated using the validation dataset according to the Saffir-Simpson scale which is the disaster-potential scale of a TC proposed by Simpson et al. [64] and widely used for defining the experimental danger stage based on wind speed. Their study showed the estimation performance according to the TC development stage. Table 4 shows the overall errors in terms of five accuracy metrics by model and Figure 6 shows the categorical performance differences between "Control" and "Control4channels" on the Saffir-Simpson scale. Whereas the "Control" and "Control4channels" models showed similar performances in terms of overall accuracy metrics, they yielded a considerable different MPE. The one-channel-based model ("Control") showed a high MPE in weak TC stages and a low MPE in strong TC stages when compared to the multi-channel-based model. The IR-only model tended to overestimate the intensity of weak TCs and to underestimate that of strong TCs. On the other hand, the multi-channel-based model resulted in a relatively stable performance over all stages of TCs. This implies that the multi-infrared channel-based TC intensity estimation approach is a more reasonable method than the single-channel model. Because all the models tested in this study showed reasonable performances in terms of the ME, rRMSE, and NSE at around ±1 kts, under 25% and over 0.75, respectively, MAE and RMSE were mainly used to determine the best models. In this study, the "2d3" model resulted in the highest performance with an MAE of 6.09 kts and an RMSE of 8.32 kts. The 2D-CNNs showed slightly better performance than the 3D-CNNs when using the same hyper-parameter conditions. Due to the large number of parameters to be optimized in the 3D-CNN models, the 3D-CNN models could only produce relatively low performance when the same conditions of convolutional layer depth and epochs as the 2D-CNN models were used. Table 4. Training (through parameterization) and validation results based on the test and validation datasets, respectively, for the nine CNN-based TC intensity estimation models. The model which resulted in the best validation accuracy for each of the 2D-CNN and 3D-CNN approaches is in bold. (Index (unit): MAE (kts), RMSE (kts), rRMSE (%), MPE (%) and NSE (0-1, unitless)). The best performing 2D-CNN and 3D-CNN based models are shown in bold. The best 2D-CNN and 3D-CNN-based models showing the highest training accuracy with optimization were the 2d3 and 3d2 models, respectively. They were compared in order to evaluate the model performance using validation data. Figure 7 shows the overall validation results of the two selected models. Whereas the 3D-CNN-based model showed higher MAE and RMSE than the 2D-CNN model overall, there were different trends in the estimation results according to the The best 2D-CNN and 3D-CNN-based models showing the highest training accuracy with optimization were the 2d3 and 3d2 models, respectively. They were compared in order to evaluate the model performance using validation data. Figure 7 shows the overall validation results of the two selected models. Whereas the 3D-CNN-based model showed higher MAE and RMSE than the 2D-CNN model overall, there were different trends in the estimation results according to the development stage of the TCs. Table 5 shows the Saffir-Simpson typhoon scale-based categorical performance of the two models using ME, MPE, RMSE, and rRMSE metrics. Both models yielded the best performance with the lowest RMSE and rRMSE in Phase Five. However, the 2D-CNN model showed the largest RMSE    Figure 6.

Model
In addition, we tested our models with typhoon HATO, KHANUN, LAN, and DAMREY in 2017 in order to evaluate the general estimation performance of our models to unseen TC cases. Figure 9 shows   It is not possible to directly compare the results among different studies since different areas and data were used. Nonetheless, it is meaningful to qualitatively compare the results with similar studies. Table 6 shows the TC intensity estimation performance of the existing models from the literature and the models proposed in this study. The multi-spectral infrared image-based models, which were proposed in this study, showed comparable and even better results when compared to the existing models.  Figure 8 shows the time series estimation results of the four models (i.e., "2d3", "3d2", "Control", and "Control4channels" models) using  Figure 6.
In addition, we tested our models with typhoon HATO, KHANUN, LAN, and DAMREY in 2017 in order to evaluate the general estimation performance of our models to unseen TC cases.  , and 13.63%, respectively, which were comparable to the validation results. Similarly, the metrics of the 3d3 model for 2017 were 1.34 kts, 4.18%, 8.6 kts, 11.31 kts, 18.54%, respectively. These results confirm that the models proposed in this study successfully learned various TC convective patterns by intensity from the past data, which can be generalized to estimate the intensity for new TCs in the future.

Visualization
Deep learning methods, including CNN, are widely called "Black box" because they struggle to identify the causal relationship among the variables and the model parameters. Zeiler et al. [65] proposed an innovative visualization method called a "heat map". It is extracted based on the sum of the activation maps in the last convolutional layer [65,66]. In this paper, we resize the heat map to the size of the raw input data to intuitively interpret the images. Some TC cases of typhoon BOLAVEN in 2012 and corresponding heat maps are shown in Figure 10. Due to the difference in the kernel processes between 2D-CNN and 3D-CNN, 2D-CNN has only one heat map for the multi-channel input data, whereas 3D-CNN has more than two heat maps (i.e., corresponding to the size of input channels). In this study, we designed the 3D-CNN model to extract four heat maps to understand and interpret the effect of each layer. Whereas most high-intensity TCs showed a clear whirling pattern in the center of the TCs, the TC pattern 23/08/2012 18:00 UTC in Figure 10 looked like a cluster of clouds rather than a spiral pattern. Because most high-intensity TCs have a clear spiral pattern, this anomalous pattern might result in the models underestimating the intensity.
It is obvious that the higher the intensity of a TC, the higher the importance around the center of the TC. Weak TCs such as the 19/08/2012 06:00 UTC case in Figure 10 do not have a concentrated convection cloud center, but the heat map showed patterns of clusters with small convective clouds around the edge of TCs. Strong TCs such as the 24/08/2012 06:00 UTC case in Figure 10 typically have a whirlpool shape around the center. The areas of high values in the heat map (i.e., important part of a TC image) have a similar shape to the main pattern of the TC proposed in Dvorak technique's figures which were categorically suggested according to the T-number and TC intensity [67]. Figure 11 depicts the "3d2" model-based significant regions of each TC according to each development stage and the corresponding Dvorak's pattern. The red area indicates the most important region, which has high values in the heat map. The red regions are like the TC patterns used in the Dvorak technique-based algorithm [6,7,67]. This implies that our CNN-based model has a ability to objectively replicate the Dvorak technique.

Interpretation of Relationship between Multi-Spectral TC Images and Intensity
In this study, we proposed several 2D-/3D-CNN based TC intensity estimation models that use multi-spectral satellite images. Whereas long-wave infrared images have been mainly used to estimate TC intensity, the lower atmosphere has a significant effect in estimating TC intensity, especially for high-intensity TCs [68]. Each infrared channel does not represent the exact altitude of the atmosphere. However, thanks to the differences in the wavelengths of channels, different convective patterns can be yielded from multiple channels. Through the results of 'Control' and 'Control4channels' models in Section 4.1, it is considered that the multi-spectral image-based CNN models showed better or comparable performance compared to the existing methods when using only a single long-wavelength infrared image.
The stronger the intensity of a TC, the stronger the vortex around the TC center of the lower atmosphere. Whereas the Dvorak technique has been widely used to estimate the intensity of TCs with cloud top images (i.e., long-wavelength infrared images), actual TC intensity is significantly influenced by lower layers of the atmosphere [68][69][70]. Cha et al. [68] showed that an improved initialization method of TC vortex with consecutive cycle simulations of a dynamical model could realistically enhance TC intensity by improving the initial three-dimensional structure of TC, in particular, stronger tangential and radial winds at the lower level. Thus, multi-levels infrared images including low and mid-levels could contribute to the realistic intensity estimation in operational TC centers. The 3D-CNN-based multi-layered heat maps can reasonably represent the vertically coupled TC structure between lower and upper levels, which was also confirmed by numerical simulations with dynamical TC models. Remote Sens. 2019, 11, x FOR PEER REVIEW 18 of 26

Interpretation of Relationship between Multi-Spectral TC Images and Intensity
In this study, we proposed several 2D-/3D-CNN based TC intensity estimation models that use multi-spectral satellite images. Whereas long-wave infrared images have been mainly used to estimate TC intensity, the lower atmosphere has a significant effect in estimating TC intensity, especially for high-intensity TCs [68]. Each infrared channel does not represent the exact altitude of the atmosphere. However, thanks to the differences in the wavelengths of channels, different convective patterns can be yielded from multiple channels. Through the results of 'Control' and 'Control4channels' models in Section 4.1, it is considered that the multi-spectral image-based CNN Figure 11. Matching between the Dvorak technique-based TC patterns [57] and the CNN model-based activated regions according to TC categories. The "3d2" model-based activated region in the IR2 channel is marked in red. The 1 st column shows heat maps with the IR2 channel and the most significant region is extracted using the upper 3% of the heat map values. It is marked as the red region in the 2 nd column. The 3 rd column shows the TC patterns of each categorical TC used in the Dvorak technique.
The integrated map by distance from the center of Typhoon MUIFA using the multi-layered heat maps is shown in Figure 12. Heat map values that had the same pixel-based distance from the center were summed in the integrated heat map with the distance measurements in the x-axis. The resultant integrated map was originally a 4 × 71 sized matrix, which was then resized to a 28 × 71 sized matrix using linear interpolation for a more intuitive interpretation of the vertical trends of a TC. The integrated heat map shows the difference in the region of importance according to the distance from the center and the relative atmospheric height. Typhoon MUIFA in 2011 and Typhoon BOLAVEN in 2012 were tested when identifying the vertical structure of TCs according to their intensity. Figure 13 shows the integrated heat maps according to intensity. With weak TCs, the heat map importance was concentrated far from the center of the TCs. This confirms the typical patterns of weak TCs, which have scattered partial convective clouds [7,9,15,67]. On the other hand, strong TC based heat maps showed a tendency for the heat map importance to focus on the center of TCs in the lower layer. This also corresponds to the vertical behavior of strong TCs (i.e., central concentrated convection) [68][69][70]. These results verified that the 3D-CNN-based estimation models considered the geophysical characteristics of TCs, which in turn affected the estimation results.
using linear interpolation for a more intuitive interpretation of the vertical trends of a TC. The integrated heat map shows the difference in the region of importance according to the distance from the center and the relative atmospheric height. Typhoon MUIFA in 2011 and Typhoon BOLAVEN in 2012 were tested when identifying the vertical structure of TCs according to their intensity. Figure 13 shows the integrated heat maps according to intensity. With weak TCs, the heat map importance was concentrated far from the center of the TCs. This confirms the typical patterns of weak TCs, which have scattered partial convective clouds [7,9,15,67]. On the other hand, strong TC based heat maps showed a tendency for the heat map importance to focus on the center of TCs in the lower layer. This also corresponds to the vertical behavior of strong TCs (i.e., central concentrated convection) [68][69][70]. These results verified that the 3D-CNN-based estimation models considered the geophysical characteristics of TCs, which in turn affected the estimation results.

Novelty and Limitation
In this study, we proposed a CNN-based TC intensity estimation approach using multi-spectral satellite images. We found out that CNN-based models successfully mimicked the manual TC

Novelty and Limitation
In this study, we proposed a CNN-based TC intensity estimation approach using multi-spectral satellite images. We found out that CNN-based models successfully mimicked the manual TC intensity estimation algorithms, and the proposed multi-spectral approach made a significant contribution to TC intensity estimation. In particular, multi-channel infrared data-based TC intensity estimation models showed more stable performance when compared to the one-channel-based model, which has been widely used for estimating TC intensity up to now. It was verified that the convective cloud patterns of the middle and lower layer, as well as the upper convective cloud pattern, has considerable effects on reliable estimations of TC intensity. The significant regions of each vertical layer using heat maps were also identified using the 3D-CNN-based model.
However, there are still some limitations in the approach proposed in this research: 1) since CNNs, in particular, 3D-CNNs, require significant computational demand in terms of memory and running time, it is often difficult to fully optimize the hyper-parameters of the CNN models, and 2) it is hard to clearly understand how the models consider input data through the neural networks to produce reasonable results. Whereas various combinations of hyper-parameters were tested to identify an optimum model, it was not possible to test all available ones for our dataset. There is still a possibility that there may be a better performing model, especially one using 3D-CNN, which was not tested. Significant computational demand is one of the main problems in deep learning-based research. Whereas deep learning often has high uncertainty in the modeling process due to its dependency on training data, it can be helpful when we need to examine a huge amount of unknown information [71,72]. However, it is also difficult to identify and quantify the effects of training data on CNN models. Whereas several methods, such as heat maps and occlusion maps, have been proposed for the interpretation of CNN results, it is still not clear how CNN models recognize the pattern of input images [73][74][75][76].
The research findings from this study deserve further investigation. In future work, hyper-parameter-optimization of the 3D-CNN model using cost-effective approaches (e.g., auto-parameterization tools such as AutoKeras and Keras-tuner) should be conducted. A fully optimized 3D-CNN based model may provide a more stable and robust performance than the present model. In addition, numerical model-derived multi-dimensional TC variables can be examined in conjunction with 3D-CNN models, providing an in-depth understanding of the relationship between three-dimensional TC structure and intensity. While this study is focused on estimating TC intensity, deep learning can be adopted for forecasting TC intensity (e.g., 12, 24, and 36 h), which needs further investigation.

Conclusions
Since the widely-used TC intensity estimation method, the Dvorak technique, is a manual algorithm, there is a need to have a standardized and objective way to quantify TC intensity for end-users such as Typhoon centers and forecasters. In this research, both 2D-CNN and 3D-CNN approaches were carefully evaluated to analyze the interrelationships between multi-channel-TC images and their intensity. The 2D-CNN-based approach resulted in a very good performance with an RMSE of 8.32 kts, which is very competitive when compared to the existing approaches. Although the 3D-CNN based model yielded an RMSE of 11.34 kts, which is higher than that of the 2D-CNN based model, its performance was still comparable with the existing approaches. For the hindcast validation, both 2D-and 3D-CNN models produced similar results (i.e., RMSE of 8.31 and 11.31 kts, respectively), which proved the robustness of the proposed models. Our TC intensity estimation model based on multi-spectral channels showed better performance (~35.9%) when compared to the existing approach with a single-spectral channel [17] based on the same datasets.
Furthermore, how the 3D-CNN model regressed the TC intensity using satellite-based multiple infrared images was examined using heat maps. Through this experiment, it was found that the pattern of the inner core part of a TC was closely related to TC intensity. In particular, the proposed