SmaAt-UNet: Precipitation Nowcasting using a Small Attention-UNet Architecture

Weather forecasting is dominated by numerical weather prediction that tries to model accurately the physical properties of the atmosphere. A downside of numerical weather prediction is that it is lacking the ability for short-term forecasts using the latest available information. By using a data-driven neural network approach we show that it is possible to produce an accurate precipitation nowcast. To this end, we propose \textit{SmaAt-UNet}, an efficient convolutional neural networks based on the well known UNet architecture equipped with attention modules and depthwise-separable convolutions. We evaluate our approach on a real-life dataset using precipitation maps from the region of the Netherlands. The experimental results show that in terms of accuracy the proposed model is comparable to other examined models while only using a quarter of the trainable parameters.


Introduction
Computational weather forecasting is an ubiquitous feature of modern, industrialized societies and is used for planning, organization and management of a wide range of both personal and economic aspects of life. To date, the primary method for weather forecasts is numerical weather prediction (NWP). NWP relies on mathematical models that take into account different physical properties of the atmosphere such as air velocity, pressure and temperature. The NWP based models can generate accurate weather predictions of several hours to days into the future. However, they involve solving highly complex mathematical models which are computationally expensive and require enormous computing power and thus usually are performed on expensive supercomputers [1].
Due to their high computational and time requirements, NWP models are less suitable for short-term forecasts ranging from minutes to up to 6 hours, also referred to as nowcasting [2]. Nowcasting models are able to use the latest available observational weather data to create their predictions, making them more responsive compared to the NWP models [3]. This responsiveness is critical to increase the accuracy of predictions for dynamic and rapidly changing environments such as the atmosphere. Nowcasts have therefore become important tools to complement NWP approaches, especially in the context of meteorologically unstable conditions typical for severe weather hazards such as thunderstorms and heavy rainfall [3]. As highlighted by a status report to the American Meteorological Society, nowcasting thunderstorms finds pertinent applications across a variety of fields such as in aviation, the construction industry, power utilities and ground transportation [4]. * Corresponding author Nowcasting was also used in the 2008 Olympic games in Beijing to ensure the safety of the athletes [5]. Not least, weather nowcasts can also be useful for planning ordinary activities of everyday life.
Recent advances in artificial neural network architectures (ANNs) have enabled data-driven based models to bridge the present gap for short-term forecasting [6,7,8,9]. The key difference between NWP and a ANNs is that the former is a model-driven and the latter a data-driven approach. Unlike the model-driven approaches, data-driven models do not base their prediction on the calculations of the underlying physics of the atmosphere. Instead, they analyze and learn from historical weather data such as past wind speed and precipitation maps to predict the future.
In this paper, we introduce a novel artificial neural network based model to predict precipitation on a high-resolution grid 30 minutes into the future. The input data for our model consists of precipitation maps, i.e. cartographic radar images showing the accumulated rainfall over a period of time. In previous studies, convolutional neural networks have been described as an effective approach to process image data. Convolutions are kernel-based operations that slide over the image which enables the model to capture local invariant features in a more efficient manner than other feedforward approaches [10]. They have been successfully applied in various fields including not only the processing of images but also of other types of signals. For instance, the authors of [11] used a CNN-based model to create captions for an input image while [12] employed a CNN for object detection in images. The authors in [13] introduced a 3-dimensional CNN based model to predict the wind speed in different cities in Denmark. In another study, a CNN-based architecture is applied on signals from a smartphone's accelerometer to classify a user's transportation mode [14].
Given the usefulness of CNNs for tasks involving image in-put, they offer a promising solution for the purpose of precipitation nowcasting. In this paper, we propose Small Attention-UNet (SmaAt-UNet) model. It uses the UNet architecture [15] as core model and is equipped with attention modules and depthwise-separable convolutions (DSC). (see section 3 for more details).
The advantage of our model is that we are able to reduce the model parameter size to a quarter of the original UNet implementation while maintaining a comparable performance to the original UNet architecture. This reduction in model size opens up the possibility to the use of precipitation models on small computation units such as smartphones, similar to [16]. This could enable the use of personalized and up-to-date precipitation forecasts by creating a forecast on user request with the latest available data within seconds. Furthermore, a model size reduction with similar performance than bigger models is crucial for creating efficient architectures that require less training and computational power.
This paper is organized as follows. A brief overview of related research on weather forecasting using machine learning architectures is presented in section 2. In section 3, we describe our proposed UNet based architecture for precipitation nowcasting as well as the models against which we compared the performance of our model. Section 4 describes the experiments conducted for this study and the obtained results. A discussion of the results is given in section 5. Lastly, we end with some conclusive remarks in section 6.

Related Work
A common approach to precipitation nowcasting based on deep learning uses neural networks that have some kind of memory such as a Long-short term memory (LSTM) [17]. In standard feedforward models, the input is passed on in a straight forward fashion from one timestep to the next. In contrast, LSTMs are, broadly speaking, networks that enable the input signal to remain in the network's state for multiple time steps enabling the network to remember past inputs. This is especially useful for time-series predictions because past inputs can have valuable information about trends which, in turn, can be useful for predicting future values.
The authors in [6] created a convolutional LSTM that captures spatiotemporal correlations better than other approaches in a time-series task for images. Extending on this, the authors of [7] created a spatiotemporal-LSTM that increases the amount of memory connections inside the network which aims at enabling an efficient flow of spatial information. The memory function and memory flow of this model were optimized in another implementation that added stacked memory modules [18].
Another approach for precipitation nowcasting has been described in [19]. They proposed a network structure that is based on a well-known encoder-decoder architecture called UNet [15]. Unlike LSTMs, UNet has no explicit modeling of memory. It takes an input image (or multiple concatenated images) and outputs a single classification map. The implementation of [19] aimed at classifying four different rain intensities (< 0.1mm/h, < 1.0mm/h, < 2.5mm/h, > 2.5mm/h) one hour into the future. To this end, multiple precipitation maps (of the past hour) are concatenated and used as input to the UNet architecture. In a similar study in [9], as opposed to the model described in [19] the authors classified 512 classes instead of just four, thereby resulting in a much finer resolution of rain intensities. This is similar to our approach; however, rather than predicting classes, our model predicts exact rain intensities.
A common baseline in precipitation nowcasting is the persistence method. The persistence model uses the last input image of a sequence as the prediction image. This is based on the assumption that the weather will not change significantly from time point t to t+1. Especially in nowcasting this baseline is not trivial to be outperformed because the time differences between images are so short (e.g., 2 or 5 minutes) that often weather conditions remain the same [1].
Recently, it was shown that attention in CNNs can be a very useful tool to enhance performance for an underlying task [20,21,22,23,24]. Attention is a mechanism that amplifies wanted signals and suppresses unwanted ones. This directs the network to pay more attention to features important for the task at hand. In our proposed model, we employ convolutional block attention modules (CBAMs) that take the input image and apply attention in sequence to the channels and then to the spatial dimensions [25]. The result of a CBAM is a weighted feature map that takes into account the channels and also the spatial region of the input image. In another application of attention, authors of [23] added attention gates to a UNet architecture for a medical segmentation task. They found that their enhanced model achieved better results than the original UNet implementation by [15].
Having fewer parameters in a network reduces the chance of possible overfitting, because the model is simpler and can't adapt too closely to the training set's distribution. A possible downside to this simplification is that the model may be too simple to learn the desired task. In order to reduce the number of parameters without sacrificing a lot of performance, depthwise-separable convolutions (DSC) are used in many recent architectures [26,27,28,16,29]. DSCs split up the regular convolutional operation into two separate operations: a depthwise convolution followed by a pointwise convolution. This results in fewer mathematical operations and also fewer parameters compared to non-separated convolutions. The authors in [29] created a UNet with DSCs instead of regular convolutions and their model has eight times less parameters than the original UNet implementation. They show that their model is able to have a similar performance as UNet on medical segmentation tasks [29].

Proposed SmaAt-UNet
The model that we propose here builds upon and extends the UNet architecture [15]. As shown in Fig. 1, the UNet architecture consists of an encoder-decoder structure resulting in a U-shape. The encoder-part (corresponding to the left half of Fig. 1) applies max-pooling (red arrows) and a double convolution (blue arrows) which halves the image size and doubles the number of feature maps, respectively. The encoders are subsequently followed by the same amount of decoders (corresponding to the right half of Fig. 1). Following the original implementation of UNet, here we also use four encoder-decoder modules.
A decoder consists of three parts: a bilinear upsampling operation (green arrows) to double the feature map size, a concatenation of the resulting feature maps with the previous encoder's output via skip-connections (grey arrows), and lastly a double convolution to half the number of feature maps. The skip-connections enable the model to use multiple scales of the input to generate the output. Finally, the last layer in our model is a 1 × 1 convolution (purple arrow) which outputs a single feature map representing the value predicted by the network.
The advantage of using different scales is that they can capture differently sized objects in the input which can be important for some tasks. Typically, UNets are applied to classification or segmentation tasks in which the network is trained to predict a class for each pixel. However, we applied it to a time series prediction task in which the network has to predict an exact value for each pixel.
Our novel Small Attention-UNet (SmaAt-UNet) model makes two modifications in the original UNet architecture. Firstly, we propose to add an attention mechanism to the encoder part. Secondly, we transform the regular convolutional operations to depthwise-separable convolutions.
As described in section 2, using attention in a CNN facilitates the network to focus on specific parts of the input. For our model, we use convolutional block attention modules for the purpose of identifying important features across channels and spatial regions of the image [25]. In our dataset, an input image consists of 12 channels corresponding to 12 sequential time points. In CBAMs, the attention mechanism is applied first across the channels of the image and subsequently to the spatial dimension.
The CBAMs are placed after the first double convolution and every encoder to amplify important features and suppress unimportant ones on the respective image scale (yellow arrows in Fig. 1). Importantly, the input to the encoders is the convoluted and downsampled image from the previous encoder and not the image with the attention mechanism applied. This way, the original image features are preserved until the last encoder. The attention modules only feed into the corresponding upsampling part of the network to which they are connected through the skip-connections.
Following the lines of [26,16], we used depthwise-separable convolutions in our model in order to reduce the number of parameters. In particular, we substitute all convolutions of the original UNet model with depthwise-separable convolutions. However, in the convolutional block attention modules we still apply regular convolutions.

Other models
For comparison, we also trained other UNet architectures that have either none or only one of the two modifications that we proposed. This results in a total of four models being compared in this study, i.e. the original UNet, UNet with CBAM, UNet with DSCs, and our proposed model. Table 1 shows a comparison of the models' parameters. When looking at the standard UNet architecture and our proposed modified UNet architecture it can be seen that the latter has significantly fewer parameters, i.e. ≈17m compared to ≈4m. In our PyTorch implementation 1 we use DSCs with two kernels-per-layer.

Training
All four previously described models were trained for a maximum of 200 epochs. We employed an early stopping criterion which stopped the training process when the validation loss did not increase in the last 15 epochs. The early stopping criterion was met in all training iterations so that the maximum of 200 epochs was never reached. Additionally, we used a learning rate scheduler that reduced the learning rate to a tenth of the previous learning rate when the validation loss did not increase for four epochs. The initial learning rate was set to 0.001 and we used the Adam optimizer [30] with default values. The training was done on a single NVidia 2070 Super with 8Gb of VRAM.

Model evaluation
The loss function used in this study is the mean squared error (MSE) between the output images and the ground truth images. The MSE is calculated as follows: where n is the number of samples, y i is the value of the ground truth andŷ i is the value of the prediction. In addition to the MSE, we calculate different scores for performance evaluation, such as Precision, Recall, Accuracy and F1-score, critical success index (CSI) and false alarm rate (FAR). These scores are calculated for rainfall bigger than a threshold of 0.5mm/h. To do this, we convert each pixel of the predicted output and target images to a boolean mask using this threshold. From this, one can calculate the true positives (TP) (prediction=1, target=1), false positives (FP) (pre-diction=1, target=0), true negatives (TN) (prediction=0, tar-get=0) and false negatives (FN) (prediction=0, target=1). Subsequently the CSI and FAR metrics can be computed as follows: The threshold of 0.5mm/h was chosen in line with the works by [6,31] and it differentiates between rain and no rain.

Experiments
We used a precipitation dataset from the Royal Netherlands Meteorological Institute (Koninklijk Nederlands Meteorologisch Instituut, KNMI) to train and compare our models. It contains rain maps in 5-minute intervals from the last four years (2016-2019) of the region of the Netherlands and the neighboring countries. In total, the dataset comprises about 420,000 rain maps. The data is generated by radar measurements from two Dutch radar stations (De Bilt and Den Helder). We split up the dataset into a training set (years 2016-2018) and a testing set (year 2019). Additionally, for every training iteration, a validation set was created by randomly selecting 10% of the training set.
The raw rain maps have a dimension of 765 × 700 and one pixel corresponds to the accumulated rainfall in the last five minutes on one square kilometer. The amount of rainfall is noted as an integer value in the unit of a hundredth of millimeter. For instance, a value of 12 means there was 0.12mm of rainfall in the last five minutes.
As a data preparation step, we divided the values of both the training and testing set by the highest occurring value in the training set to normalize the data. Furthermore, we cropped the image and only used a subset of the original precipitation map (Fig. 2). This was done due the fact that many pixels of the raw image have no-data values because the raw image is larger than the maximum range of the radar (see the white margin in the left panel of Fig. 2). The area within the range of the radar has a circular shape and a diameter of 421 pixels corresponding to 421 kilometers. When cropping the image in a way that preserves the entire radar image there are still many pixels with no-data values (white corners in the middle panel of Fig. 2). Since training a neural network with no-data values is more difficult, we therefore applied an additional center crop of 288 pixels (right panel of Fig. 2).
The input for the models is a sequence of 12 precipitation maps which are stacked along the channel dimension. This corresponds to one hour of past weather observations (12 x 5min). The output is the precipitation map 30 minutes later than the last input image. Therefore, the task for the network is to predict exact rainfall intensities for every pixel of the 288 × 288 rain map 30 minutes into the future.
The dataset contains many rain maps with very little to no rain. Therefore, in order to avoid biasing the network towards predicting zero values we created two additional datasets whose target images have a minimum amount of rainy pixels. One of the two datasets has samples with at least 20% of rainy pixels in the target images and the other one with at least 50%. The number of samples in these two datasets is necessarily significantly smaller than the original dataset, but they also more appropriately apply to the use-case of the model, i.e. predicting rain. A comparison of different sample sizes of the three datasets can be found in Table 2.
We trained the models on the dataset in which the target image has at least 50% of rain in the pixels. This should set the focus of the trained networks on instances of rain. Something similar was done by the authors of [6] who select the top 97 rainy days of their dataset of three years for training.
Furthermore, this enables the use of the dataset with at least 20% of rain as an additional performance indicator. More precisely, we can use it as an indicator for the generalizability of the models. The trained models have not seen a single precipitation map of this test dataset. Furthermore, the models may have been biased towards predicting more rain due to the nature of predominantly rainy precipitation maps. Therefore it is possible that the performance of the models on this test set is worse than the one that closely resembles the data they are trained on.

Results and Discussion
Following training of the four models, we selected, for each model, the one with the lowest validation loss from their training run. These best performing models were then used to calculate the MSE on the test set. The obtained results are tabulated in Table 3. Note that the MSE is calculated after denormalizing the model predictions to the original rain intensities (mm/5min).
The obtained results show that the common persistence baseline is outperformed by every model we tested by a large margin. This is noteworthy because, as mentioned before, it can be difficult to outperform this baseline in nowcasting due to the small time changes between the input and target.
We found that adding the proposed two modifications, i.e. DSCs and CBAMs, to the UNet architecture altered the models performance in comparison to the original UNet implementation. On the one hand, implementing each modification alone slightly decreased the performance. On the other hand, however, our proposed model, SmaAt-Unet which incorporates both modifications into plain UNet, resulted in a better performance than UNet combined with each of the modifications alone. It should be noted that equipping UNet with only CBAMs, resulted in the highest MSE with 1422.08. Concerning our second modification, i.e. substituting the regular convolutions with DSCs, the results are more mixed. On the one hand, performance of the UNet with DSCs is worse than the original UNet model (1060.52 and 1013.12, respectively). However, it still performs better than the UNet model with CBAMs. On the other hand, it is important to note that substituting regular convolutions by DSCs reduced the network's model size to a quarter of the original UNet.   Figure 3 shows an example of the models output for a precipitation nowcast. In contrast to the ground truth image (top left panel) the predicted precipitation maps of all models are quite blurry. One reason for this is the use of MSE as guiding loss which is biased towards blurry images [32]. The bias towards blurriness is due to the fact that, given the many possibilities for future frames based on the input sequence, the model is trying to keep the error low by predicting a value that is closest to all possible outcomes [33]. Or, as Babaeizadeh et al put it, "the models trained with a mean squared error loss function generate the expected value of all the possibilities for each pixel independently, which is potentially inherently blurry" [34].
Furthermore, one can see in Figure 3 that SmaAt-Unet is able to capture the development of intense rain clusters (lower left corner) better than the other models. UNet with DSCs predicts a spread that is too big on the horizontal elongation. UNet with CBAM does this better, but predicts values that are too conservative. UNet produces a similar output than SmaAt-UNet, but does not predict well enough the vertical spread of the precipitation of the left rain cluster.
Furthermore, we have calculated several performance metrics and the obtained scores are also tabulated in Table 3. This table shows that while the original UNet implementation performs best in most scores, our SmaAt-UNet performs second best in five out of the six scores. Thus, the SmaAt-Unet is able to approximate UNet's performance even though it only has 1/4 of its parameters.
In order to test the generalizability of the models we use the other subset of our dataset that was described in section 4, i.e. the one that requires only 20% of the target image pixels to contain rain. The MSE and scores for this test set are given in Table 4. As can be seen in this table, the results are similar to the ones in Table 3. Specifically, when ranking the models we can see that the original UNet implementation performs best in almost all metrics and our SmaAt-UNet comes in as close second in almost all metrics as well. This means that although the models have not seen many inputs with little rain, UNet and SmaAt-UNet are able to extrapolate best from the limited data that was available to them at training time. An explanation for the lower MSE in this dataset is that more values of the precipitation maps are close to zero (due to little rain) and therefore do not increase the overall MSE by a large margin if the model also predicts small values. Figure 4, depicts example feature maps from the attention part of the encoder modules. The figure illustrates that the network's attention maps learn to focus on particular parts of the input sequence, demonstrating the learning effect of the attention mechanism. The rows depict the different stages of the encoders which can be seen by a decrease in resolution in each row. Furthermore, it can be seen that the attention feature maps focus on different characteristics of the input. For example, in the first row, some feature maps focus on a rain cluster in the lower left corner (maps 2 and 8) while others focus on the parts with little to no rain (maps 4, 5 and 7). The bottom row shows feature maps from the last encoder stage of the SmaAt-UNet which have a resolution of 18 × 18. As the images in the bottom row illustrate, this low resolution leads the network to identify coarse patterns such as the rain cluster at the bottom of the maps (maps 2, 3, 5 and 7).

Conclusion and Future Work
In this paper we proposed SmaAt-UNet which is a smaller and attentive version of a UNet architecture. It has been shown   that it performs on par to similar architectures that are way bigger than itself on a precipitation nowcasting task. Furthermore, the experiments showed that using only one of our two proposed changes is insufficient to reach a good performance. The development of small and efficient neural networks, such as SmaAt-UNet, enables their application in smartphones. For instance, creating an application with multiple trained SmaAt-Unets with different forecasting times allows precipitation forecasting with the latest available data at the users request. Furthermore, creating energy efficient architectures, such as SmaAt-UNet, reduces the carbon footprint. Being mindful of the resources that are required for training a neural network is a crucial step towards sustainable machine learning practices. Even though we trained our models on a dataset that is quite small, however, without augmentation techniques we were able to achieve good results on the held-out test set. Increasing the dataset should yield better performances for the tested models. As a next step we would test the capabilities of SmaAt-Unet on different datasets and on different tasks.