Improving remote sensing of extreme events with machine learning: land surface temperature retrievals from IASI observations

Retrieving weather extremes from observations is critical for weather forecasting and climate impact studies. Statistical and machine learning methods are increasingly popular in the remote sensing community. However, these models act as regression tools when dealing with regression problems and as such, they are not always well-suited for the estimation of the extreme weather states. This study firstly introduces two error types that arise from such statistical methods: (a) ‘dampening’ refers to the reduction of the range of variability in the retrieved values, a natural behavior for regression models; (b) ‘inflating’ is the opposite effect (i.e. larger ranges) due to data pooling. We then introduce the concept of localization that intends to better take into account local conditions in the statistical model. Localization largely improves the retrievals of extreme states, and can be used both for retrieval at the pixel level or in image processing techniques. This approach is tested on the retrieval of land surface temperature using infrared atmospheric sounding interferometer observations: the dampening is reduced from 1.9 K to 1.6 K, and the inflating from 1.1 K to 0.5 K, respectively.


Introduction
Extreme events are defined as unusual and severe weather that generally lie in the tail of the observed historical distributions. A better monitoring of extreme weather states is essential for many socioeconomic sectors. For example, droughts have a large impact on food production and will require changes in water management strategies in the agricultural sector [1][2][3]. Prolonged heatwaves have a direct impact on health and mortality [4][5][6][7], with mortality from heat extremes between 1980 and 2009 which is double what it would have been without climate change [8]. These ongoing changes have an impact on the public and private sectors. It is therefore urgent to better monitor such events to limit their socioeconomic impact.
Currently, numerical weather prediction (NWP) models are the best tool to predict weather extremes. NWP relies on our ability to assimilate observations and in particular weather extremes observations. Satellite remote sensing is the ideal tool to monitor the atmosphere and the land/ocean surfaces at a global scale and with a high revisit time (e.g. geostationary satellites such as Meteosat Second Generation have a revisit time of 15 min). To do so, the satellite community has increasingly relied on statistical methods, including neural networks (NNs), to process Earth observations [9][10][11][12][13]. However the very nature of such statistical regression methods is to make compromises as they are trained on a large database of 'pooled' samples. Data pooling here refers to the process of combining data samples from either several sources or, in the case of remote sensing, different locations around the globe. Thus they have the tendency to better perform on average conditions and less so on more extreme conditions. A regression model provides a trivial example, because it always explains less than 100% of the true variance of a dataset [14], meaning that extremes cannot be perfectly estimated. This is what we call the 'dampening' of the retrieval ranges of variability. In parallel, to simplify the implementation of the models, data from different locations are often pooled together to train a unique and global model, meaning that the same NN is used to retrieve variables over somewhat different environments and is required to make compromises. This leads, although less frequently, to what we refer to as the 'inflating' of the retrieval ranges of variability. It is crucial to understand and control these two types of errors (i.e. dampening and inflating) in order to make the best possible use of emerging machine learning (ML) methods, in particular (but not only) for a better quantification of weather extremes.
Localization is a novel principle used to help the NN model adjust its behavior to local conditions [15,16]. Several forms of localization, such as additional inputs or pixel-wise models, can be used to improve NN retrievals. More recent image-processing techniques such as convolutional neural networks (CNNs) allow for the use of spatial dependencies that both help the retrieval and reduce the data pooling at the origin of the inflating: this is another form of localization. Such localization strategies should help limit the dampening/inflating issues of statistical models. This study aims to document and improve ML retrievals by using localization to reduce the dampening and inflating errors when retrieving extreme values.
In this study, we focus on the Earth's land surface temperature (LST) which is listed as an essential climate variable under the Global Climate Observing System. Specifically, we rely on infrared atmospheric sounding interferometer (IASI) to estimate LST [17][18][19][20], a product that is operationally assimilated in NWP models [21][22][23]. This LST application is used as an example to show the improvements on the retrieval of extreme states, but the method presented here is general and can be transposed to any remote sensing application. Section 2 presents the data used in this study and provides a detailed explanation of the dampening and inflating effects, and the methods proposed here to limit them. Results and method comparisons are presented in section 3. Conclusions and perspectives are given in section 4.

Datasets
For the application part of this study, we use a database of collocated brightness temperature (TB) observations from IASI (Metop A) and European Centre for Medium-Range Weather Forecasts (ECMWF) ERA5 reanalysis of LST [24,25]. A surface emissivities climatology will be used as an ancillary database. Both the TB observations and the emissivities are regridded onto a 0.25 • regular grid. Regridding is done by averaging the TB (or emissivity) values of IASI pixels whose centers fall in each gridpoint. The ERA5 LST is downloadable on the same regular grid. We use data ranging from July 2007 to December 2018 over a fixed domain covering France and its surroundings only, offering a rich variety of surfaces (including coasts, lowland and mountains). The domain ranges from 41 • N to 52 • N in latitude and 5 • W to 10 • E in longitude. Figure 1 outlines the domain by showing the elevation (in meters) of each grid point. Such a limited domain was chosen to facilitate the use of novel image-processing deep learning (DL) techniques. All processed data is available in [26].

Satellite observations
The IASI instrument is a Michelson interferometer measuring infrared radiation placed onboard the European meteorological satellites MetOp. The MetOp series is operated by the European Organisation for the Exploitation of Meteorological Satellites (EUMETSTAT). IASI offers a high resolution infrared spectrum between 675 and 2760 cm −1 over 8461 channels. This spectrum includes several window channels for the retrieval of surface properties, from which we regularly extract channels between 770-980 cm −1 , 1080-1150 cm −1 and 1210-1650 cm −1 (with a total of 563 channels). This extraction is a good compromise that includes most of the information contained in these window channels while keeping some redundancy to help reduce the instrument noise. We then apply a principal component analysis (PCA), an effective method to reduce dimensionality [17,[27][28][29] prior to using the ML/DL algorithms. The information is compressed using a PCA into three principal components (PCs) representing 99.30%, 0.54% and 0.08% of the total variance (for a total of 99.92%). These PCs contain most of the information about surface properties from the IASI spectrum. Such a high percentage for the primary PC can be explained by the high spectral correlation between the extracted channels. We extract an image each time a MetOp A polar orbit partially covers the chosen domain (approximately twice a day), which results in 15 711 'images' over the period. Only 8242 images of the domain over the ten-year period are kept after screening out the images only covering a small part of the domain or those with a large cloud fraction. The acceptance criteria is that there should be at least 30 clear pixels in the image, and a Euclidean distance of at least 20 pixels between a pair of clear pixels. Naturally, the extracted images only contain clear-sky pixels, since no information on the surface can be obtained for cloudy pixels. An analysis of the extracted database was performed to ensure that both summer and winter months were fairly represented.

ERA5 LST database
The ECMWF ERA5 reanalysis [25] is a re-processing of an NWP model constrained with most of the available observations. It is used to estimate the state of the surface and atmosphere with an hourly timestep. The LST product over the domain is extracted from this reanalysis, and the observed clear-sky IASI TBs are collocated with the closest hour of the LST product.

TELSEI Emissivity climatology
A climatology of surface emissivities derived from TELSEI [18, a Tool to Estimate Land Surface Emissivity in the Infrared] is used. This tool gives monthly global estimates of infrared emissivity with IASI spectral resolution. To stay coherent with the extracted IASI TB channels, three channels are kept for the emissivity database. As in [15], we choose here the 850, 900 and 1100 cm −1 wave-numbers to be representative of the overall spectrum.

LST retrieval methodology
Statistical models have been used in the remote sensing community for over three decades. Such methods, and more specifically NNs, have proven to be particularly efficient to retrieve geophysical variables from satellite observations, see IASI applications in [17-20, 27, 30-34]. However, a well-known problem when using statistical models is its tendency to dampen the range of values of the retrieved variables. This issue is not specific to NNs but is general to most statistical retrieval schemes, including random forests, linear and non-linear regressions, optimal interpolation, etc. It should be pointed out that this is also the case for non-linear regression models. This section aims to explain the reason behind this problem and provides a tool to ameliorate the retrievals of extreme values.

Multilayer perceptrons (MLPs) and CNNs
MLPs are a simple and very common type of NNs, in which neurons of one layer have direct connections to neurons in the next layer. If enough samples are available, and the architecture contains enough neurons, any continuous relationship can be learnt by an MLP model [35][36][37]. An MLP tries to approximate a function f(x) = y as f(x) = y, where x are the inputs to the MLP and y the MLPs approximation of the targets y. The quality of y strongly depends on the information available in the inputs x. In the remote sensing community, the inputs to MLP models are generally from single pixels only. This is both because there is a common consensus that there is little or no information that can help the retrieval among the neighboring pixels, and also because the nature of the data (i.e. shifting orbits) does not facilitate the use of image-processing approaches.
We nonetheless propose to test CNNs, one of the most popular image-processing models, to exploit the information in neighboring pixels and spatial patterns to improve the quality of retrievals. CNNs are deep NN models where each convolutional layer is a set of filters passed over the height and width of the input volume (typically, lat × lon × PCs). After training of the NN is complete, filters activate when they detect a certain spatial pattern. These approaches require to use images in input to the model. For our case study, it is therefore necessary to transform the original IASI data (i.e. orbits) into images. Here, this is done by focusing on a specific domain and capturing an image every time an orbit passes over the chosen area. Other solutions for IASI data are considered in [16]. As previously mentioned, only clear-sky pixels can be used to retrieve LST. However, CNNs require complete images, both for the training and operational prediction phases. This is done by inter-and extrapolating the images. The remaining missing pixels are considered larger holes. These holes are filled using an algorithm relying on the empirical orthogonal function (EOF) representation of images [27,28,38]. In this approach, an image X can be rewritten as X = λ 1 · F 1 + λ 2 · F 2 + . . . + λ m · F m . After calculating the EOFs (F 1 , . . .F m ) using the covariance matrix of our database (computed using only clear pixels) the missing data are filled with a first guess: in this case we use λ 1 · F 1 with λ 1 that minimizes the root mean square error over the available pixels. To finish, an optimization procedure is used to project the now complete image onto the EOF basis to obtain λ 1 . . . λ m and replacing the missing data with λ 1 · F 1 + λ 2 · F 2 + . . . + λ m · F m , until convergence of the λ coefficients. This protocol is described more in depth in [15].

The 'dampening' effect of statistical methods vs. the 'inflating' effect of data pooling
We introduce the dampening effect, defined by the range of values in the retrieved LST dataset being smaller than the range of values in the original (ERA5) LST dataset. Similarly, we introduce the Mathematically, let us define the set of samples of extreme low and high values of LST at the pixel level (pixel i) from the 10th lower (Q ERA5 0.1 (i)) and higher (Q ERA5 0.9 (i)) percentiles: where t spans the time dimension. This definition is valid both for the original (ERA5) and retrieved dataset. We then define an estimate of the range over a pixel i, for a particular dataset, as: The range R i can thus be understood as the average LST value of samples in S H (i) minus the average LST value of samples in S L (i). Dampening is the process by which the retrieved range, computed from the y values, is smaller than the real range, computed from the y ERA5 values: Because the retrieval errors are larger for these extreme values, it is often argued that the problem is due to a lack of extreme conditions in the training dataset. However, enriching the training dataset of the NN with additional extreme conditions does not solve the problem because sampling is not the real issue. Figure 2(a) aims to illustrate what happens. We consider a NN with only one input (TB PC1) and one output (LST). Let us consider a cloud of points (in red) representing the actual true input-to-output relationship (i.e. the TB-to-LST relationship). As previously described, an NN is an algorithm that tries to approximate a complex function f(TB) = LST. The NN can be seen as a injective function with only one output for each input. This function is shown in blue in the figure. Since the relationship is monovariate, the NN approximates the cloud of points for every little horizontal bin along the x-axis with an average value of the target LST (the black squares on the figure). It can be seen that the NN approximation (in green) runs through these points. The extreme values (both smallest and largest) in the shaded areas will thus not be present in the retrieval range. Adding samples to represent these extreme values will not reduce the vertical dispersion. In fact, no matter the statistical method used (linear regression, random forests, NN, etc), this vertical dispersion cannot be reduced. The only possible solution would be to introduce more PCs into the model with the expectation that the additional information will help explain and predict the existing vertical dispersion in all horizontal bins.
When working on a spatial domain like it is the case in this study, a form of data pooling is commonly done. Data pooling refers to combining multiple pixels of the domain to train a model, over which the TB-to-LST relationships differ. This data pooling creates an overall dampening on the global range. However, at a local scale, the LST ranges can in fact be wider than what they should be. This happens because the obtained NN performs a compromise among all the pixels. The expectation when using a global NN is that the retrieval performs well on each location (i.e. over the Alps, over coastal areas . . .). This is an ambitious task, all the more so when working at the global scale. This would be possible if we had all the necessary information to correctly represent the radiative transfer (RT) (forward or inverse) all around the world, whatever the environment type or local conditions, but it is difficult for the Earth surfaces where the RT is still not entirely satisfactory [39]. Some surface parameters describing the soil properties, or the state of vegetation are not available at the global scale so only a simplified relationship is attainable. A global NN would in fact find the best compromise, using a simplified relationship between the satellite observations and the variable(s) to retrieve. Due to this, we may obtain an inflating effect: This inflation is contradictory with the nature of regression models. This is why there are fewer pixels with inflating than dampening. Figure 2(b) shows the dampening and inflating effects by plotting the probability density function (PDF) of the difference in ranges (∆ R = R RET − R ERA5 ) for all pixels. There is a net dampening when ∆ R < 0, and a net inflating when ∆ R > 0. Both the dampening and inflating effects have negative consequences on the retrieval (i.e. they contribute to the error of the model), especially on extreme events. Whether one wishes to focus on minimizing one or the other depends on the particular application. For instance one may choose to rather focus on reducing the dampening error to not miss an extreme event or the inflating effects to avoid false alarms. The goal in this study is to improve the estimation of extreme cases by limiting both the dampening and inflating effects as much as possible.

Localization
Localization, defined as a way to help the NN adjust its behavior to local conditions [15], is a powerful tool to improve retrievals [16]. Localization of NNs is a new idea in the remote sensing community. Indeed, the majority of NN models used nowadays in remote sensing are global and standard CNN models use weight-sharing. However, it was shown in [15,16] that using localization techniques improves the NN results by adjusting its behavior to specific local conditions. This is very important for operational retrievals from satellite observations. Localization limits the need for data pooling and will therefore limit the inflating effect. In addition, localization improves the retrieval across all horizontal bins, and should therefore limit the dampening effect too. Boucher et al [15,16] also discuss the advantages and disadvantages of localization.
Several localization approaches exist, both for MLP models and image-processing techniques like CNNs. A possibility is to add more information into the NN inputs. These auxiliary variables help the NN to understand better the physical conditions on each pixel, by better constraining the RT on this location. It is therefore able to better adapt its behavior to varying environments. Another form of auxiliary variables was introduced in [40], in which visible bands and thermal images are combined, to propagate high-resolution information for the retrieval of SST. In this study, we solely use surface emissivities as a localization variable [18, see section 2.1.3]. This approach is possible for both MLP and CNN models.
Alternatively, it is possible to use independent MLP models for every pixel of the domain. This way, the models do not have to make any compromise since they focus only on the input-to-output relationship over one particular pixel. By localizing using pixel-based NNs, less information is required to fully describe the RT on each location: some auxiliary variables might be omitted, in particular if they are constant on each pixel.
Another localization consists in a modified version of CNNs: the Localized-CNN, in which the convolution layer is thus replaced with a locally connected layer, in which specialized spatial filters are derived for every pixel of the domain instead of using the same filters everywhere (associated to weight sharing). Chen et al [41] introduce this layer (that we will call a local convolution) as an intermediate solution between a fully connected layer and a classic convolutional layer. Using such a layer for remote sensing applications was first introduced in [15,16] and was proven to be a successful form of localization to improve retrieval statistics. In the following, several retrieval methods with different localization strategies will be compared:  convolutional layer with one kernel is enough to capture the spatial dependencies present in the full image. This is due to the highly localized property of the layer. This results in a unique feature map. The Adam optimization algorithm [42] is used for the training.
Training hyper-parameters for each model are shown in table 1. Please note that different optimization algorithms are used because in the case of the unique MLP, the problem is more complex: a lot of different local conditions are pooled into one big database and optimizing the MLP model with the Adam algorithm did not lead to satisfactory results. The Levenberg-Marquardt algorithm is more efficient in this case. However, this algorithm was not available to train CNNs in Python, therefore we opted for the well-known Adam algorithm in this case. Furthermore, all models are trained with separate training and validation sets and early stopping was used to ensure that no overfitting occurs. The training time of the Unique MLP model is 5 min. Each independent MLP model takes 1-2 s to train, depending on the pixel location. Lastly, the localized-CNN model takes 8 min to train.

Application to LST retrieval
Three diagnostics will be used in the following to quantify the two effects: (a) the percentage of time in which we observe a net dampening and inflating, (b) the average amount of dampening and inflating (expressed in K), and (c) the mean absolute error (MAE) that takes into account both types of errors. The goal is to improve the estimation of extreme cases by limiting both the dampening and inflating effects.

Analysis of the dampening and inflating effects
We show the results in table 2, which includes the diagnostics mentioned above. This table decomposes the error originated by the dampening and inflating effects. Each diagnostic is performed on the extreme lows and highs (the retrieved vs. ERA5 LST values themselves) and the LST ranges.
It is seen that the more pooling is done in the learning database, the higher the inflating becomes. Indeed, on all sets (i.e. extreme lows, extreme highs, and LST ranges) we observe that the Unique MLP model inflates the most. This is due to the compromise such a model must makes. The model has to find similarities between the TBs-to-LST relationship on very different pixels. If one pixel has a much wider range than the other, this larger range will tend to be present in the retrieval of the pixel whose range was smaller, leading to an inflating effect of the range. The Independent MLPs hardly suffer from the inflating effect: by construction, there is no pooling done for these models. The remaining inflating is just due to stochastic variations in the optimization performed during the training. The Localized-CNN model struggles a little more from the inflating effect in terms of percentage. However, the average inflating that occurs remains very low, below 0.5 K. The dampening effect (naturally) appears to be the most problematic source of error. Ideally, we would like the fraction of dampening to be 100% (i.e. no inflating effect), as this is the normal behavior of a regression model, whilst having the average inflating as close to 0 as possible. The Localized-CNN offers a very satisfying compromise considering this objective. The Localized-CNN reduces drastically the average dampening that is observed in comparison to the Unique MLP, and even compared to the Independent MLPs (especially on extreme highs).
The MAE considers both types of errors, giving them an equal weight. This diagnostic gives a good overview on the performance of the different models for extreme events. The Localized-CNN demonstrates the best performance.
The table can be translated into a spatial representation in figure 3. The first three rows are maps showing the average dampening and inflating effects of extreme lows (i.e. overestimation in most cases), of extreme highs (i.e. underestimation in most cases) and of LST ranges. This is done for each method and each pixel of the domain. Overall, it is over the Alps and Pyrénées mountains that the largest error is made, especially for the extreme lows. As mentioned previously, using Independent MLPs as opposed to a Unique MLP helps reduce the inflating and thus the MAE. However, there remains a significant amount of dampening. The Localized-CNN image-processing allows the model to utilize the spatial dependencies present in the images. The retrieval Table 2. Summary of the dampening and inflating effects on extreme lows, extreme highs and LST ranges. The extreme lows and highs (SL and SH) are respectively defined as the average LSTs on pixels that lie outside the 10th-90th percentiles of the ERA5 LST, and the ranges as the average value in the extreme highs minus the average value in the extreme lows (see section 2.2.2). For each, we show the fraction of the time in which dampening/inflating is observed (in %), the average dampening and inflating (in K), and the MAE that takes into account both types of errors.

Set
Variable Diagnostic in each pixel uses the 5 × 5 neighboring pixel filters.
Since information is extracted from the neighboring pixels, the model is able to better detect extreme cases. This leads to a drastic reduction of the dampening effect, but re-introduces a slight inflating, mostly in pixels surrounding the coldest pixels (i.e. mountains).
In general terms, a difference in performance of the three methods can be observed between extreme highs and extreme lows. Summer and winter months being equally represented in the training database, this difference is not due to the methodology, but rather comes from an inherent physical difficulty for retrieval over the low extremes. This may be due to the cloud contamination, that is more likely to occur in cold situations [43].
Overall we can see that diminishing the pooling that is done in the database by localizing the retrieval algorithm as much as possible results in smaller errors in the extreme cases, both by limiting the artificial inflating effect and minimizing the dampening effect.

Two case studies: retrievals during past cold spells and heatwaves
Looking at past cold spells and heatwaves over the chosen domain is another way to appreciate the quality of extreme value retrievals. It is important to be able to retrieve this type of events from the satellite observations, in particular for weather prediction.
A focus is made on the cold spell that hit France in February 2012. This cold spell lasted from the 1st to 13th February 2012 and was one of France's strongest during the last 50 years (https://public.wmo.int/en/media/news/cold-spelleurope-late-winter-20112012). Extreme cold temperatures were recorded over most of the country, due to an anticyclonic situation which favored the circulation of icy air. The fourth row of figure 3 shows, for each retrieval scheme, the maps of the average overestimation during the cold spell. We can see that once again, the more localized the network is, the lower the dampening effect becomes. Localization allows to better adapt the retrieval for each pixel, and the model that use neighboring pixels performs even better during the cold spell, showing that using spatial patterns helps to better categorize extreme situations. Comparing columns (a) and (c) shows that retrievals during the cold spell are improved by 1 K despite the same input variables being used. Using neighboring pixels that also gravitate towards a cold temperature gives more weight to the resulting value being as cold as expected.
Similarly, another focus is made on the heatwave that France faced at the end of July-beginning of August 2018 (www.thelocal.fr/20180829/summer-2018-frances-second-hottest-july-and-august-onrecord/). The average underestimation of the LST during the heatwave is shown in the last row of figure 3. This heatwave occurred during what was, at the time, the second hottest summer since the beginning of climate records, coming just after the 2003 summer. It was one of the longest heatwaves since the post-war period. Overall, the more localized the NN is, the better it performs on the heatwave retrieval. The MAE is reduced from 1.9 K when using the Unique MLP to 1.1 K when using the Localized-CNN.

Conclusion
Extreme events are essential for weather and climate impact studies. The satellite retrieval of extreme events by statistical models is true challenge. In this study, we introduce two types of errors that occur when estimating extreme events: the dampening and inflating effects. Dampening comes from the fact that, by construction, statistical methods reduce the range of the retrieval, as they hardly can explain 100% of the variance of a database. Contrarily, inflating occurs when pooling is done (i.e. mixing several datasets into one; here the different locations of the domain) so that the behavior over one pixel is transposed into another pixel of the domain.
This study shows that the novel concept of localizing the retrieval helps reducing drastically the dampening and inflating effects, and therefore reduces the overall error that is made by the NNs on extreme events. This was shown both on the LST extreme values themselves and on the LST ranges. We show that using an image-processing approach such as the Localized-CNN has the advantage of limiting the inflating, and of drastically reducing the dampening on the ranges of variability to 1.1 and 0.5 K respectively. This paper focuses on the LST variable on a limited spatial domain only, but perspectives include extending this work to a global scale retrieval [44,45], other surface and atmospheric variables and future instruments such as IASI-New Generation [46] or Seviri instruments. The dampening effect is inevitable when using statistical models, and with the emergence of Artificial Intelligence techniques, it is important to understand and limit its effect on the retrieval of extreme events. Monitoring extreme weather events includes looking at other atmospheric variables such as volcanic SO 2 [47] and CO 2 [48] for example for the detection of fires and hurricanes. Applying the presented techniques to such variables has the potential to facilitate the detection of such extreme weather events if assimilated into NWP models [48]. More generally, the method presented throughout this paper is easily transferable to other applications of NN models. Perspectives also include combining localization methods (i.e. image processing techniques and localization strategies).