Broad-UNet: Multi-scale feature learning for nowcasting tasks

Weather nowcasting consists of predicting meteorological components in the short term at high spatial resolutions. Due to its influence in many human activities, accurate nowcasting has recently gained plenty of attention. In this paper, we treat the nowcasting problem as an image-to-image translation problem using satellite imagery. We introduce Broad-UNet, a novel architecture based on the core UNet model, to efficiently address this problem. In particular, the proposed Broad-UNet is equipped with asymmetric parallel convolutions as well as Atrous Spatial Pyramid Pooling (ASPP) module. In this way, The the Broad-UNet model learns more complex patterns by combining multi-scale features while using fewer parameters than the core UNet model. The proposed model is applied on two different nowcasting tasks, i.e. precipitation maps and cloud cover nowcasting. The obtained numerical results show that the introduced Broad-UNet model performs more accurate predictions compared to the other examined architectures.


Introduction
Weather forecasting is an essential task that has a great influence on humans daily life and activities. Industries such as agriculture [1], mining [2] and construction [3] rely on the weather forecasts to make decisions and thus unexpected climatological events may result in large economic losses. Similarly, accurate weather forecasts improve safety on flights and roads and help us foresee potential natural disasters.
Due to its importance, precipitation nowcasting is becoming an increasingly popular research topic. This term refers to the problem of forecasting precipitation in the near future at high spatial resolutions. It is usually performed through satellite imagery and many different approaches have been proposed for this problem. Classical nowcasting approaches mainly focus on two methods: Numerical Weather Prediction (NWP) [4] and extrapolation based techniques, such as Optical Flow (OF) [5]. NWP methods simulate the underlying physics of the atmosphere and ocean to generate predictions, so they require a vast amount of computational resources. In contrast, optical flow based methods identify and predict how objects move through a sequence of images. But they are unable to represent the dynamics behind them. In recent years, the massive amount of existing data has aroused research interest in data driven machine learning techniques for nowcasting [6,7,8]. By taking advantage of available historical data, data-driven based approaches have shown better performance than classical ones in many forecasting tasks [9]. Furthermore, while classical machine learning techniques rely on handcrafted features and domain knowledge, deep learning techniques automatize the extraction of those features. Recent advances in deep learning have shown promising results in diverse research areas such as neuroscience, biomedical signal analysis, weather forecasting and dynamical systems, among others [10,11,12,13,14,15,16,17,18]. Convolutional Neural Networks (CNNs) are the most popular algorithms used in computer vision [19], achieving the state-of-the-art in various tasks [20,19,21]. CNN architectures, such as AlexNet [22], ResNet [23] and InceptionNet [24], to name a few, mainly consist of the combination of convolutional and pooling layers. They are outstanding at classification, identification and recognition tasks. Among other architectures, autoencoders have emerged as one of the most powerful approaches in both supervised [25,26,27] and unsupervised learning [28,29,30] with the UNet [31] being one of the most versatile architectures. The UNet architecture was first proposed for medical image segmentation, but it has been employed in various domains [32,33,27]. It consists of a contracting path, to extract features, and an expanding path, to reconstruct a segmented image, with a set of residual connection between them to enable precise localization. In our previous work [27], we introduced various extended versions of the UNet for weather forecasting problem. In this paper, we further extend the best performing model in that work [27], i.e. the AsymmIncepRes3DDR-UNet. In particular, motivated by the results of [34], we augment the AsymmIncepRes3DDR-UNet's feature extraction capacity by incorporating an Atrous Spatial Pyramidal Pooling module (ASPP) [34] in the bottleneck of the network. The ASPP module works in line with the existing building blocks of our network (Multi-scale feature convolutional block), extracting multi-scale features in parallel and combining them. Therefore, unlike the original UNet, the proposed model is designed to capture multi-scale information. In addition, it keeps the temporal dimension unchanged along the encoder path and then reduces it before being concatenated with the output of every level in the decoder path. As a result, it can efficiently learn a mapping between 3-dimensional input data and 2-dimensional output data. Furthermore, we apply a kernel factorization in most of the convolutional operations of the model, resulting in a significant reduction in the total number of parameters compared to the original UNet while having improved performance. These techniques are explained in detail in the subsequent sections. We further present an analysis of this multi-scale features extraction and the enhancement provided by the ASPP module. We show its versatility by applying it to two different nowcasting tasks, i.e. precipitation nowcasting and cloud cover nowcasting. In the precipitation nowcasting task, the model performs a regression of every pixel. In the case of cloud cover nowcasting, the model classifies each pixel as containing clouds or not. In addition, we directly compare the proposed model with the model introduced in [32], a variation of UNet architecture that relies on depthwise-separable convolutions and includes a CBAM attention module [35] at each level. While the model in [32] approximates the performance of the original UNet with a significantly reduced number of parameters, our model outperforms the original UNet with a reduced number of parameters.

Related work
Traditionally, optical flow based models are the most popular techniques among classical methods for precipitation nowcasting tasks [36,37]. However, machine learning and deep learning based approaches are dominating this field of research in recent years. Due to the vast amount of available satellite imagery, powerful deep neural networks based models are suitable candidates that can be used to address various problems existing in this field. In particular, CNN based architectures have show their great ability to handle 2D and 3D images. Thanks to the versatility of CNN's, nowcasting problems can be tackled in different fashions. For instance, the authors in [38] and [39] treated the multiple time-steps as multiple channels in the network. In this way, they could apply a simple 2D-CNN to perform the predictions. Additionally, the authors in [40] treated the multiple time-steps as depth in the samples. Thus they can apply a 3D-CNN and approximate more complex functions. As it has been shown in [41,39,32], among the used CNN architectures, UNet is more suitable for this task, due to its autoencoder-like architecture and ability to tackle image-toimage translation problems.
In addition to CNN's, Recurrent Neural Networks (RNN's) have proved to be a robust approach. However, these architectures struggle to work with images but can capture longrange dependencies, an ability that CNN's can only partially achieve with the addition of attention mechanisms, such as self-attention [42]. In [6], the authors introduce an architecture that combines both CNN's and RNN's strengths. They extend the fully connected LSTM (FC-LSTM) with convolutional structures, obtaining the Convolutional LSTM network (ConvLSTM). As a result, the proposed model captures spatiotemporal correlations better than the FC-LSTM model. The authors in [40] introduce the Trajectory GRU (TrajGRU) model as an extension of the ConvLSTM. This architecture keeps the advantages of the previous model and also learns the locationvariant structure of the recurrent connections, showing superior performance than the other models compared. Nevertheless, these RNN models have not been directly compared with the UNet in nowcasting tasks. The authors in [26] make a comparison among different types of models for cloud cover nowcasting. In [26], the models under assessment are various versions of CNN's, RNN's, LSTM and UNet. The authors showed that the UNet model is the best performing model for the given cloud cover nowcasting task.

Proposed model
In this section, we introduce our Broad-UNet model. First, different elements that are used for building the network are presented. The complete architecture is then explained.

Multi-scale feature convolutional block
Motivated by the goal of extracting features at different scales, the model contains a block consisting of parallel arms as shown in Fig 1. This block serves as the core building block of our network. Within this block, the data forks into parallel branches of convolutions with different kernel sizes, after going through an initial convolution. A 3 × 3 × 3 convolution is followed by a set of parallel convolutions with 1 × 1 × 1, 3 × 3 × 3 and 5 × 5 × 5 kernel sizes. The outputs of the different branches are then concatenated and merged with a 1 × 1 × 1 convolution. Additionally, inspired by the results found in [43], we keep some information intact alongside the parallel branches with a residual connection. Lastly, the output of the block is rectified with a ReLU activation function. To reduce the large number of features resulting from these branches, we factorize the convolutions as suggested in [44]. That means a convolution N × N × N decomposes into the three consecutive 1 × 1 × N, 1 × N × 1 and N × 1 × 1 convolutions. Hence, this sequence is an approximation of the original convolution with fewer parameters.

Atrous Spatial Pyramid Pooling (ASPP)
Atrous Spatial Pyramid Pooling (ASPP), is a mechanism used to capture multi-scale information. It consists of parallel branches of convolutions, similar to the convolutional block presented above. However, instead of using different kernel sizes, the same kernel is chosen with an increasing dilation rate (6, 12 and 18). In this kind of convolutions, the filter is upsampled by inserting zeros between successive values. As a result, they employ a larger field of view, without experiencing an explosion in the number of parameters. Further, it only extracts information in the spatial dimensions by applying a 2dimensional filter (shape 1 × N × N). In addition, ASPP incorporates one branch to extract image-levels features, allowing to capture global context information. Here, we implement it by applying a global average pooling, and subsequent reshaping and upsampling back. The obtained extracted features are then concatenated and combined with a 1 × 1 × 1 convolution. The scheme of this mechanism is shown in Fig. 2.

Broad-UNet
Thanks to the effectiveness of UNet architecture in solving image-to-image mapping tasks, it is chosen to serve as the basis to construct our model. UNet core model which was originally proposed for medical image segmentation tasks, adopts an autoencoder structure. While the encoder part extracts features from the input image, the decoder part performs classification on each pixel to reconstruct the segmented output. Plus, a set of residual connections between both parts allows a precise localization in the output image. Differently, our proposed Broad-UNet manipulates 3-dimensional data in the encoder and 2-dimensional data in the decoder. Thus, we can input several time-steps in the first dimension, and it outputs only one timestep in the same dimension. Multi-scale feature convolutional blocks are alternated with pooling operations in the encoder, resulting in five levels. The pooling is only performed in the spatial dimensions (2nd and 3rd) and implemented with a Max Pooling layer. In this way, the temporal dimension of the data remains unchanged. Then, the decoder follows a similar structure. It alternates multi-scale feature convolutional blocks and upsampling in the spatial dimensions. Additionally, we incorporate extra convolutions in the connections between different levels of the encoder and decoder. These intermediate convolutional operations aim to reduce the temporal dimension from T time-steps to 1.
To extend the multi-scale feature learning process, we combine the convolutional blocks with the ASPP module. It is placed in the bottleneck of the network, where the data has a highly abstract representation. In this way, we allow the network to capture more information from this representation without using larger kernels and more computational resources. Also, dropout is included in the bottleneck to force the network to learn a more sparse representation of the data and avoid possible overfitting. As a result, the network input is of shape T × H × W × F and output is of shape 1 × H × W × F, where T is the number of time-steps (lags), H and W are the height and width of the images, and F is the number of features or elements, which we consider as channels in our network. Here, the convolutions to reduce the temporal dimension have a kernel size 1 × T × T with valid padding. In addition, the use of asymmetric convolutions drastically reduces the total number of parameters of the network. While the number of parameters is ∼28 million using regular kernels N × N × N, the number of parameters after factorizing the convolutions into 1 × 1 × N, 1 × N × 1 and N × 1 × 1 is ∼11 million. The complete architecture of the model can be found in Fig. 3. Furthermore, a comparison in the number of learnable parameters among different UNet based models examined in this paper is shown in Table 1.

Data description and preprocessing
To assess the performance of our model, we apply it to two different datasets. Both of them consist of satellite images and

Model Number of parameters
are intended to tackle weather nowcasting problems. The first one includes precipitation maps, in which the value of each pixel shows the amount of rainfall in that region. The second one consists of cloud cover maps, in which the pixel values are binary and indicate whether there is a cloud or not in that region. Here, we recreate the same samples as in [32] for the first dataset, and as in [26] for the second dataset. In this way, we can make a fair comparison with the results obtained in those research works. For reproducibility purposes, all our models and scripts are available on Github 1 . Also, the datasets and pre-trained models are available upon request.

Precipitation maps dataset
The first dataset, provided by the Royal Netherlands Meteorological Institute (Koninklijk Nederlands Meteorologisch Instituut, KNMI) [45], includes rainfall measurements from two 1 https://github.com/jesusgf96/Broad-UNet Dutch radar stations (De Bilt and Den Helder). These measurements are in the shape of images. The images cover the region of the Netherlands and neighbouring countries, spanning four years in 5-minutes intervals. To train and validate the models, we use data from the years 2016-2018 (80% train/ 20% validation), and the data from 2019 is used as test set.
The values of each pixel represent the accumulated amount of rainfall in the last five minutes. That means that a value n represents n × 10 −2 mm of rain per square kilometre. The resolution of the images is 765 × 700, and the measured region is circle-shaped with a large margin. Following the lines of [32], we cropped the central squared area with size 288 × 288, as shown in Fig. 4.
Moreover, there is a high imbalance between pixels with rain and no rain, with plenty of images lacking raining pixels. Therefore, as in [32], we filter the dataset choosing only the images with at least 50% of pixels containing any amount of rain. This dataset is then used to create the training/validation/test samples. Additionally, we create a second dataset filtering the images with at least 20% of pixels containing any amount of rain. From this second dataset, we use only the test set. Therefore, it serves as a way of testing our trained models under different conditions. We also normalize both datasets by dividing them by the highest value in the training set.

Cloud cover dataset
The second dataset is the "Geostationary Nowcasting Cloud Type" classification product [46] from the European Organisa- For multi-comparison purposes, we generate two different dataset from this data. In the first dataset, we follow the lines of [26] and use data from 2017 and the first semester of 2018 as training set. Then the data from the second semester of 2018 is used for both validation and test. On the contrary, as in [32], we use data from 2017 and the first semester of 2018 for train and validate our models (80% train / 20% validation). The data from the second semester of 2018 is thus used only for test. We use data from 2017 and the first semester of 2018 to train the models. To validate and test the models, we use data from the second semester of 2018. In this data, every pixel can have 15 different values (1: Cloud-free land, 2: Cloud-free sea, 3: Snow over land, 4: Sea ice, 5: Very low clouds, 6: Low clouds, 7: Mid-level clouds, 8: High opaque clouds, 9: Very high opaque clouds, 10: Fractional clouds, 11: High semitransparent thin clouds, 12: High semitransparent meanly thick clouds, 13: High semitransparent thick clouds, 14: High semitransparent above low or medium clouds, 15: High semitransparent above snow/ice). However, following the lines of [26], we aim to perform a classification between cloud or no-cloud. Therefore, we group the labels from 1 to 4 into 0 (no-cloud) and the labels from 5 to 15 into 1 (cloud). Also, we crop the images according to the boundaries of France: [51.896, 41.104, -5.842, 9.842] (upper latitude, lower latitude, left longitude, right longitude). Then we apply a transformation to obtain a suitable projection and reshape the resulting image to 256 × 256 pixels. Fig. 9, displays an example of the described pre-processing steps.

Experimental setup and evaluation
In order to have a fair comparison with the results obtained in [32] and [26] for both datasets, we reproduce the same experimental setups. The data is arranged in such a way that the resulting input is a four-dimensional array I ∈ R T ×H×W×F , where T is the number of lags or previous time-steps, which corre- sponds to the time dimension. H and W refer to the size of the image and make up the spatial dimensions. The last element F corresponds to the predicted features, which in both cases is 1. We use TensorFlow to implement our models and train and evaluate them on the given datasets. The hyperparameters of models are tuned and the optimal ones are empirically found and used.

Precipitation maps nowcasting
As for the precipitation maps dataset, we apply the preprocessing and split the dataset as described in section 4.1. We aim to predict a precipitation map 30 minutes ahead or considering that the images are generated five minutes apart, six time-steps ahead. The number of lags, previous time-steps, is set to 12 which was emprically found to be the best one among othre tested lag values. The height and width of the images are 288 and 288, and the number of features in the input is 1, i.e. the precipitation maps. Therefore, the inputs of the model has the shape (12,288,288,1), and output data has the shape (1, 288, 288, 1).
In this nowcasting task, we perform a regression of every pixel. Mean Squared Error (MSE) is used as the loss function and Adam optimizer to optimize it, with an initial learning rate of 0.0001. The batch size and the dropout rate are set to 2 and 0.5, respectively. We also implemented a checkpoint callback to monitor the validation loss. Thus the best performing model on the validation set is saved. We use MSE as the main metric to assess the performance of the model in this dataset. Furthermore, we also include additional metrics such as accuracy, precision and recall. Following the lines of [32], in order to calculate these new metrics, we first create a binarized mask of the image, according to a threshold. This threshold is the mean value of the training set from the 50% of rain pixel dataset. Hence, any value equal or over the threshold is replaced by 1, and any value under it is replaced by 0.

Cloud cover nowcasting
Regarding the cloud cover dataset, we preprocess the data and split the dataset as described in section 4.2. In this case, we predict six different time-steps: from 15 minutes to 70 minutes ahead, or from 1 to 6 time-steps ahead. Due to the architecture of our network, we train six different model. Thus each model predicts a different time-step. Here, the number of lag is set to 4, the height and width of the images are 256 and 256, and the number of input features is again 1, the cloud cover map. That means that the model receives input data with the shape (4, 256, 256, 1), and outputs data with the shape (1, 256, 256, 1). In this task, we perform binary classification of every pixel. The binary cross-entropy is used as the loss function in this case. We use Adam optimizer with an initial learning rate of 0.001. The batch size and the dropout rate are set to 8 and 0.5, respectively. Similarly, we implemented a checkpoint callback to monitor the validation loss. Thus the best performing model on the validation set is saved. Following the lines of [26], here we also use MSE as the metric to assess the performance of the model. First, we calculate the MSE between the ground truth and the raw prediction as the main metric. In this case, the values between 0 and 1 in the predictions indicate the probability of cloud occurrence in that region. In addition, we binarize the prediction of the network with a threshold of 0.5 to generate a second assessment with the MSE metric. We also include additional metrics, i.e. accuracy, precision and recall, to compare the performance of the Broad-UNet with the model introduced in [32], which also uses UNet architecture as the basis. To calculate these new metrics, we first create a binarized mask of the image, using the value 0.5 as the threshold.

Precipitation maps nowcasting
In the precipitation maps prediction task, we compare the performance of the Broad-UNet with the persistence, a simple meteorological baseline used in forecasting, and different models over the test sets of two different datasets, i.e. 50% of rain pixels and 20% of rain pixels. These models are the UNet [31] and two variants [32,27]. The MSE is the main metric used for this comparison, and it is calculated over the denormalized data. The additional metrics are computed over the binarized data, as described in section 5.1. The performance of different models over the first precipitation maps dataset is shown in Table 2. In the same way, the performance of the models in the second precipitation maps dataset is listed in Table 3. From the obtained results, one can observe that the Broad-UNet achieved the lowest MSE score in both datasets. Two examples of 30 minutes ahead prediction with the Broad-UNet are displayed in Table 2: Test MSE and additional metrics values for the precipitation maps prediction task using the 50% of rain pixels dataset. ↓ indicates that the optimal values are the smallest ones and ↑ indicates that the optimal values are the highest ones.

Cloud cover nowcasting
When applying the Broad-UNet to the second dataset, we compare its performance with the persistence and various models. These models are introduced and explained in [26]. We perform this comparison with the results obtained from the test set of the cloud cover dataset. The used evaluation metrics are explained in section 5.2. In Fig. 7, we show the MSE obtained using the ground truth and the actual prediction. Fig. 8 depicts the MSE calculated with the ground truth and binarized prediction. From Fig. 7 and Fig. 8, one can notice that the Broad-UNet performance is superior in short-term forecasting. As the number of step-ahead increases, the gap between the performance of the proposed Broad-UNet and the classical UNet model decreases. In addition, in Table 4, we show the comparison between the Broad-UNet's and the model introduced in [32]. As in [32], the metrics tabulated in this   predictions are displayed in Fig. 9. Both predictions are generated with the test set of the cloud cover dataset. In Fig. 9, the

Discussion
From the obtained results, one can observe that the multiscale feature learning allows the Broad-UNet to perform more precise predictions. This is thanks to the use of different convolutional filters in parallel. By combining convolutions with larger and smaller kernels, the model considers different amounts of information around the same region to generate the feature maps. Likewise, the inclusion of the ASPP module in the architecture allows the network to apply convolutions with diverse receptive fields at the same time.
In the precipitation nowcasting task, we can observe an 11% and an 8% improvement with respect to the simple UNet for both datasets. In the cloud cover nowcasting task, the binarized predictions of the Broad-UNet are 7% more accurate than the simple UNet for 15 minutes ahead predictions, and 1% more accurate for 90 minutes ahead predictions. Since in the first nowcasting task, i.e. precipitation prediction, the model aims to perform a regression of each pixel with a wide range of values, achieving accurate forecasting or equivalently lower MSE values is more desirable. That is where the Broad-UNet shows more superior performance respect to the UNet. In the second nowcasting task, where the goal is to carry out a binary classification on each pixel, Broad-UNet performs slightly more accurate predictions than the UNet. While the immediate predictions (i.e. 15 and 30 mins ahead) are more precise, more distant predictions (more than 45 mins ahead) are comparable to UNet's predictions. Therefore, we can state that the wide building blocks of the Broad-UNet let the network to extract the spatial and short-term temporal information more accurately than the regular UNet.
The learnt feature maps in different branches inside a convolutional block is shown in Fig. 10. The chosen convolutional block is the first one so that the data doesn't have too abstract representation and is thus easier to interpret. The image fed to the network belongs to the precipitation maps dataset, and it is shown in Fig. 11. In Fig. 10, the first row of the feature maps is the output of the convolutional branch with kernel size 1. The second row corresponds to the output of the branch with kernel size 3x3x3. Lastly, the third row corresponds to the output of the branch with kernel size 5x5x5. From Fig. 11, one can observe the differences between the features extracted in each branch. The convolutions with kernel size 1 seem to strengthen detailed differences in the image, and convolutions with kernel size 3x3x3 seem to accentuate differences between areas containing a high and a low rain concentration. In addition, convolutions with kernel size 5x5x5 seem to highlight regions with high rain concentration.

Conclusion
In this paper the Broad-UNet, an extension of the UNet architecture, is introduced for precipitation as well as cloud cover nowcasting. Thanks to the combination of the multi-scale feature convolutional block and the incorporation of the ASPP Figure 11: Image fed to the network to generate the feature maps shown below. It belongs to the precipitation maps datasets, specifically to th 50% of rain pixel dataset. module, the proposed network is able to capture multi-scale information. In addition, the use of factorized kernels drastically reduces the number of parameters in the network compared to classical UNet model The performance of Broad-UNet is examined for addressing two nowcasting problems. The first problem consists of predicting precipitation maps 30 mins ahead. The second one consists of forecasting cloud cover 15 to 90 mins ahead. The obtained results suggest that the Broad-UNet extracts features more efficiently and therefore performs more accurate predictions in short-term nowcasting tasks compared to other tested UNet based models.