A deep learning framework for analyzing cloud characteristics of aggregated convection using cloud-resolving model simulations

This study introduces a framework to extract the high-dimensional nonlinear relationships among state variables for aggregated convection. The prototype of such a framework is developed that applies the convolutional neural network models (CNN models) to retrieve the cloud characteristics from cloud-resolving model (CRM) simulations. CNN model prediction factors are hidden in the high dimensional weighted parameters in each neural network layer. Therefore, we can dig out relevant physics processes by iterating the CNN models' training process and eliminating the features with the physics explanation we can provide at a given stage. Within a few iterations, explainable non-linear relationships among variables can be provided. We identified that the averaged cloud water path (CWP), the maximum value of CWP in each cloud, and the cloud coverage rate are essential for identifying aggregation. Furthermore, by analyzing the encoded channels of the CNN model, we found a strong relationship between aggregation, cloud peripherals, and fractal dimensions. The results suggest that the important nonlinear cloud characteristics for identifying the aggregation can be captured with the proper adjustment and limitation of the input data to the CNN models. Our framework provides a possibility that we can explore the high dimensional relationship between the physics process with the assistance of the CNN model.


| INTRODUCTION
Tropical deep convection can aggregate into organized systems that can reach mesoscale or even hundred-kilometer scale horizontally. These aggregated convection systems play an important role in the climate system and the regional extreme weather. Studies based on satellite observations have suggested that the meso-scale convective systems (MCSs) contribute to over half of the total precipitation over the tropics (Houze & Yuan, 2010), and large convective systems are associated with extreme precipitation events (Hamada et al., 2014). The interactions between organized convective systems and their environment modulate the moisture, cloudiness, radiation, and precipitation across a wide range of spatial and temporal scales, leading to changes in large-scale circulation.
The "moist-convective quasi equilibrium (QE)" hypothesis (Arakawa, 2004;Arakawa & Schubert, 1974) is a fundamental concept regarding the statistical relationship between the ensemble average of convection and the large-scale state of the atmosphere. Convection of various degrees of organization can exist when the ensemble effect (adjustment) is achieved (Xu et al., 1992). The QE concept has been frequently applied to numerical studies using the cloud-resolving models (CRMs), in which convective clouds are explicitly resolved with high horizontal resolutions. For example, Tsai and Wu (2017) obtained the key statistics of aggregated convection corresponding to various large-scale environments. They identified the aggregated convection by the distinct large-size mode in the distribution of the cloud system size. Their results suggest that aggregated convection would develop if the environment exhibits column relative humidity (CRH) exceeding 80% (67%) in the nonshear (shear) condition, and the large-size mode of aggregated convection becomes more distinct with increasing column relative humidity. Such processes may be related to the enhanced probability of multicellular cloud structure under the moister environment.
While aggregated convection under different environments can be distinguished more robustly in CRM simulations by giving abundant data volume and complete state variables, finding the analogy in nature from observational data sets remains challenging. Part of the reason is the limited and uneven sampling of observation in space, time, and state variables. For example, although images from geostationary satellites can provide the spatially most extensive and the temporally most continuous coverage of convective systems among all forms of observations, they only capture the nadir view of the twodimensional (2D) distribution of cloud coverage, cloudtop brightness temperature, and liquid water path, and so on. The vertical structural changes during convective organization can only be inferred from these 2D fields. Such limitations also lead to the fact that the metrics/ indices to define aggregation by observations are mostly simple and linear, such as the number of systems within a given domain, the sum of interobject distance among the systems, or the sum of the object radii. As shown in the intercomparison study by Xu et al. (2019), these indices have different sensitivities to the size of the object group and the selected variables, which lead to ambiguous or sometimes even contradictory identification of the aggregation state.
As aggregated convection is a high-dimensional phenomenon involving the nonlinear changes in the structures of convective systems, a metric to describe the aggregated convection in nature needs to involve the nonlinear relationships of multiple physical characteristics that can be sufficiently and readily observed. Our work builds a high-dimensional nonlinear statistical model that learns from the snapshot of aggregated convection in CRM simulations for the purpose of designing the appropriate definition for the observed evidence of aggregated convection snapshots. To explore the nonlinear relationships between variables, we can use a Convolutional Neural Network (CNN) that can statistically capture the characteristics of the gridded data of physical variables (Chattopadhyay et al., 2020;Higa et al., 2021;Tsou et al., 2019). Though building a predictive model with CNN can be done automatically and data-driven, some of the captured variables may contribute the variance mainly in a subset of the data while contributing only minor variance in the remaining data. Thus, the nonlinear relationship among the retrieved variables will lead to difficulty in explaining and connecting the variable to the physics process. In addition, it is desirable that the model can provide three core elements, namely, transparency, interpretability, and explainability (Roscher et al., 2020). With these three core elements, scientific insight drawn from the deep learning model will be further augmented in addition to its high accuracy (Ribeiro et al., 2016).
Therefore, in this study, we will construct an "explainable framework" to connect the CNN retrieved variables to the physics process of aggregated convection with the data from the CRM simulations. This can provide an important foundation for understanding the aggregation processes in the real world. The manuscript is organized as follows. The structure of the CNN model in this study is shown in Section 2.1. Details of the datasets are described in Section 2.2, and the iterations of the input dataset are in Section 2.3. Section 2.4 summarizes the selected data distribution and the performance of the CNN model in this study, and the relationship between the retrieved variables and the physics process will be discussed in Section 3. The summary is provided in Section 4.

| CNN model and the hidden layer
The convolutional neural network (CNN) is a robust model for recognizing the image or the distribution with spatial characteristics. The structure of the CNN applied in this study is visualized in Figure 1. This CNN model is inspired by the VGG-net from the Visual Geometric Group (VGG) (Simonyan & Zisserman, 2015).
This typical network can be applied to image classification with multiple 3 Â 3 convolutional kernel layers followed by the max-pooling layers. This convolutional structure can reduce the trainable parameters of the network while preserving the spatial distribution's characteristics. The activation functions used in this model are rectified linear units (ReLUs), except the output layer is a sigmoid function. The sigmoid function in the output layer is purposed to limit the output to a numerical value between 0 and 1, which represents the possibility of a case coming from the aggregated simulations.
Although the input data may change in each iteration, as described in Section 2.3, the structure of the CNN model remains the same among the iterations. Also, we prefer to construct a CNN model with only a few layers; thus, the model can capture the most important characteristic in each iteration. Though a CNN model with more layers (i.e., more trainable parameters) may increase accuracy, the physical interpretation will become more difficult since the signals are handled by the other layers, hampering the interpretation of the first layer by our knowledge.
Since the logic of diagnosing the aggregation hides in the connection between layers of CNN, a common way to understand the logic is to visualize the signal outputted from the hidden layer of CNN and connect it with the physics meaning we are familiar with. This study will visualize the first max-pooling layer right after the first convolution layer. Before any other convolution process, this layer will preserve more spatial information and may be helpful for us to find out the connection of the physics parameters.

| Dataset
In order to get the data that covered aggregated states, this study uses a set of experiment outputs from the simulations of Tsai and Wu (2017) with a vector vorticity equation cloud-resolving model (VVM) (Jung & Arakawa, 2008;Wu et al., 2019). The initial profiles and the horizontal uniform large-scale forcing follow the setting derived from the GATE phase III. The imposed large-scale forcing is prescribed as constant in time. The large-scale moistening effect is controlled by multiplying a scale from +6 to À6 with an interval of 1.5, representing a very moist to a very dry environment. Thus there are nine different degrees of environmentally moistening in the dataset. Tsai and Wu (2017) showed that a distinct mode in the cloud size distribution could identify the aggregated state. Following their results, we select the first and second moistest simulations as the aggregation cases, and the seventh and eighth moistest simulations are regarded as the nonaggregation cases.
Aggregated convection describes the state that convections are organized into systems reaching mesoscale or even larger horizontal scales. Such upscale processes can be better visualized as the connecting components of the cloud water field. Our framework is designed as a procedure to help explain the critical properties that are used in the CNN model to do the classification, in which the cloud water fields are regarded as useful characteristics to iterate over. Therefore, we choose specific statistical properties of the cloud condensate path in each column (CW P i,j ), including the CWP avg , the CWP max , and the Cloudy Area Coverage Rate (the masked CWP) as our input. The CW P i,j is calculated by where qc þ qi ð Þ i,j,k represents the mixing ratio of the cloud condensate (kgÁkg À1 ) in each grid point. ρ k represents the density in the specific height, and Δz k represents the grid height in the stretching grid. The whole domain in a single snapshot is 512 Â 512 Â 18 km 3 , and the horizontal resolution is 1 Â 1 km 2 . Snapshots of the aggregation and nonaggregation cases of the whole simulation domain with the time evolution of the CWP are shown in Figure 2.
To focus on the characteristics of aggregated convection and its environment rather than the characteristics between them, we selected the 128 Â 128 km 2 cloud condensate path crops over all the times and domains. This also increases the F I G U R E 2 (Non agg-1) to (non agg-4) shows the examples of the nonaggregated case, and (Agg-1) to (Agg-4) shows the aggregated case with the cloud water path values in (kgÁm À2 ). Each subgraph shows the simulations' 512 Â 512 (km) domain. The solid red lines represent the area that may be difficult to classify visually. Those areas can be the clouds before elimination in the nonaggregation cases, and the clouds are not at the core of the convection in the aggregation cases training and validation dataset, and the collection of the crops will be used as the input data for the CNN model.

| Framework and the modified dataset
CNN model with more trainable parameters and a limited dataset without any constraint may cause the overfitting problem. The overfitting will lead to a trained model without the generality and may have low performance on the dataset other than the training set (the validation set). Our framework is designed with a fixed number of trainable parameters and then iteratively modifies the training dataset. The modification depends on the leading characteristic that the model may rely on the prediction in the previous iteration. We then remove the leading characteristics and re-train the model again to see if the model can predict F I G U R E 4 (a) and (b) show the example of the comparison between nonaggregation and aggregation labeled data in the training dataset. The cropped cloud water path field on a 128Â 128 km 2 is masked by a 2 kgÁm À2 threshold. The green grid will be regarded as one, while dark blue grids will be regarded as zero. The ones and zeros will be inputted into the input layer of the CNN model. The graph in (c) shows the referenced cloud water (masked) inputted into the CNN models. From (d-0) to (d-4), show the graphs of the encoded channel 0 to 4, representing the max-pooling layer with the 42 Â 42 grid or the 128 Â 128 km domain. Each grid concludes the signal from the 3 Â 3 km 2 box. The brightness in the graphs shows the strength of the signal outputted. Each channel has its scale, so we will only focus on the relative strengths and the spatial distributions. The channel order may vary due to the randomness of the training process; however, the characteristics of the relative strengths and the spatial distributions will remain cases from the remaining dataset in the next iteration. A schematic of the framework is provided in Figure 1.
In the first round of iterating the framework, the trained model based on the complete CWP information shown in Figure 2 has an accuracy of about 98% in the validation dataset. By plotting the predicted possibility and the CWP averaged horizontally over the whole domain (128 Â 128 grids, or 128 Â 128 km 2 ) in each snapshot in Figure 3a, the cases with averaged CWP greater than 1 kgÁm À2 are easily classified by the model because there is almost no case from nonaggregation which averaged CWP greater than 1 kgÁm À2 . Thus we decided to remove these cases from the dataset and focus on the properties obtained from the remaining dataset.
The trained model built from the dataset without the information of the large average CWP value has an accuracy of about 95% in the validation set, implying that there are other characteristics captured by the CNN model to classify the cases. For the second round of iteration, the extremum of the CWP values may be a potential characteristic to classify the cases because the aggregated cases tend to have large values of cloud water owing to the higher degree of convection organization. Since Figure 3b shows that the maximum CWP in most cases is greater than 2 kgÁm À2 , we masked the CWP value information with this threshold while preserving the cloud shape information. The grids with a CWP value greater than 2 kgÁm À2 will be set to one, otherwise will be set to zero. This will diminish the distribution information of the CWP while keeping the shape of the cloud.
In the third round of iterating the framework, after removing the characteristic of averaged value and maximum CWP, the CNN model persists with 90% accuracy in the validation set. Since the distribution of the CWP is masked, we think the coverage rate of the CWP may be an important property for the CNN model to predict. As in Figure 3c, the cases with low cloud coverage rates tend to be classified as nonaggregated. However, some cases between the cloud fraction 5% and 10% are not classified well. Thus, we decided to limit the selected case coverage rate between 5% and 10% to see if the CNN model could predict the aggregation with such a dataset.
After this iteration, the modified dataset from the nonaggregated and aggregated cases may be difficult to recognize visually, as shown in Figure 4a,b. The selected nonaggregation cases are almost in the first few hours of simulation time before the environment becomes too dry to have no clouds, which has small CWP values and may be trivial to the CNN model to do the classification. This dataset is similar to the images we may obtain from the satellites of liquid water/ice channels, which do not directly observe the center of the aggregations, as shown in the red solid frames in Figure 2.
The iterating process in our framework demonstrates that we can identify important characteristics for the CNN model to classify the aggregation cases, including the averaged CWP, maximum CWP, and cloud coverage rate. Nevertheless, the accuracy of the CNN model built by this dataset is 73% on the validation, which suggests that the prediction of the CNN model is greater than the expected value of random guesses (50%). Thus, we can further investigate the hidden relationship that can be important for aggregation between variables in the next section.

| DISCUSSION
Though we cannot conclude which statistical variables of CWP the CNN model used to do the classification directly, it will be more informative for us to understand the model's behavior by exploring the characteristics captured by the CNN model with the analysis of the hidden layer.
T A B L E 1 R 2 -squared values of the cloud fraction/fractal dimension and the channel signal sums corresponded to the model used in Figure 4 (d-0) to (d-4), separated by the aggregated and nonaggregated cases The maximum values in each column are represented in bold. In the group of the cloud fractions, the max R 2 value is in channel 2, and the nonaggregation cases have a greater linear relationship. In the group of fractal dimensions, the R 2 values greater than 0.5 are shown in the aggregated case in channels 0, 1, 3, and channel 4, and the aggregation cases have greater linear relationships.
To gain insights into the characteristics that the CNN model may be used to determine the aggregation cases, we thus visualize the signal from the max-pooling layer. The example of the visualized signal in each channel shows in Figure 4 from (d-0) to (d-4), and the reference input is in Figure 4c. Strong signals can be seen at the center of the clouds in channel 2, while signals show at the edges in channels 0, 1, 3, and 4. By concluding these characteristics, the total coverage of the CWP (the cloud fraction) and the fractal dimension may be a nice perspective to understand the feature in the convolution layers.
Though the CNN model may rely on logic that could be nonlinear, a possible depiction of the logic of classification of the aggregated cases can be finding the linear relationship between the physics process that we are familiar with. Table 1 also shows the fractal dimensions have linear relationships with the signal sum from the aggregation cases in channels 0, 1, 3, and 4 with R 2 greater than the nonaggregation cases. The difference in linearities between aggregation and nonaggregation cases indicates that the fractal dimension, which represents the nonlinearity in the geometry of the cloud structure, can be a potential characteristic we may use to diagnose the cases.
These results suggest that the CNN models construct a delicate logic to capture the characteristics that may include the cloud coverage and the fractal dimensions for identifying the aggregation cases. This may be difficult to explain linearly with the physical variables and have different logic for detecting the aggregation and nonaggregation. In the conclusion of the analysis above, to separate the aggregated cases, the importance of the fractal dimensions of the cloud is more significant in the aggregated cases. In contrast, the importance of cloud coverage is more significant in nonaggregated cases.

| SUMMARY
We introduce a data-driven framework for retrieving the essential cloud characteristics based on the convolutional neural network models and cloud-resolving model simulations. Though the CNN models usually perform well in handling the dataset with the spatial pattern, the prediction factors hide in the high dimensional weighted parameters in the neural network layers. Our framework suggests good practice for extracting the explainable features used in the CNN model by iterating the training of CNN models and removing the features from the dataset that can be explained by our background knowledge to dig out subtle structures of interest through this framework. Based on the framework, we identified that the averaged cloud water path (CWP), the maximum value of CWP in each cloud, and the cloud coverage rate are essential for identifying aggregation. Furthermore, we found a strong relationship between the aggregation, cloud peripherals, and fractal dimensions by analyzing the CNN model's encoded channels.
Since this framework can be adapted to the dataset with 2D fields, the observations, such as the dataset from the satellite, could be a potential challenge in our future work. The data distribution from the observation may differ from the simulation from CRM, and it will be a good examination of the robustness of this framework.