The potential of machine learning for weather index insurance

Weather index insurance is an innovative tool in risk transfer for disasters induced by natural hazards. This paper proposes a methodology that uses machine learning algorithms for the identification of extreme flood and drought events aimed at reducing the basis risk connected to this kind of insurance mechanism. The model types selected for this study were the neural network and the support vector machine, vastly adopted for classification problems, which were built exploring thousands of possible configurations based on the combination of different model parameters. The models were developed and tested in 5 the Dominican Republic context, based on data from multiple sources covering a time period between 2000 and 2019. Using rainfall and soil moisture data, the machine learning algorithms provided a strong improvement when compared to logistic regression models, used as a baseline for both hazards. Furthermore, increasing the number of information provided during the training of the models proved to be beneficial to the performances, increasing their classification accuracy and confirming the ability of these algorithms to exploit big data and their potential for application within index insurance products. 10 Copyright statement.

of the art of machine learning models used to forecast floods, while Hao et al. (2018) and Fung et al. (2019), in their reviews on drought forecasting, give an overview of machine learning tools applied to predict drought indices. Machine learning has also been employed to forecast wind gusts (Sallis et al., 2011), severe hail (Gagne et al., 2017) and excessive rainfall (Nayak and Ghosh, 2013). In contrast, only a minor part of the body of literature focuses their attention on the identification or classification of events (Nayak and Ghosh (2013), Khalaf et al. (2018) and Alipour et al. (2020) for floods, Richman et al. (2016) for droughts, Kim et al. (2019) for tropical cyclones). However, classification of events to distinguish between extreme and non-extreme events is essential to support the development of effective parametric risk transfer instruments. In addition, the major part of the analysed studies deals with a single type of event.
This paper aims to assess the potential of machine learning for weather index insurance. To achieve this, we propose and apply a machine learning methodology that is capable of objectively identifying extreme weather events, namely flood and drought, in near-real time, using quasi-global gridded climate datasets derived from satellite imagery or a combination of 70 observation and satellite imagery. The focus of the study is then to address the following research questions: 1. Can machine learning algorithms provide improvement in terms of performance for weather index insurance with respect to traditional approaches? 2. To which extent do the performances of machine learning models improve with the addition of input data?
3. Do the best performing models share similar properties? (e.g., use more input data or consistently have similar algorithm's features).
In this study we focus on the detection of two types of weather events with very different features: floods, which are mainly local events that can develop over a time scale going from few minutes to days, and droughts, which are creeping phenomena that involve widespread areas and have a slow onset and offset. In addition, floods cause immediate losses (Plate, 2002), while droughts produce non-structural damages and their effects are delayed with respect to the beginning of the event (Wilhite, 80 2000). Both satellite images and reanalyses are used as input data to show the potential of these instruments when properly designed and managed. Two of the most used machine learning methodologies, neural network (NN) and support vector machine (SVM), are applied. With ML models it is not always straightforward to know a priori which model(s) perform (s) better, or which model configuration(s) should be used. Therefore, various model configurations are explored for both NN and SVM and a rigorous evaluation of their performances is accomplished. The best performing configurations are tested to 85 reproduce past extreme events in a case study region.
Section 2 describes the NN and SVM algorithms used in this study and their configurations, the procedure adopted to take into consideration the problem of data imbalance due to the rarity of extreme events, the assessment of the quality of the classifications and the procedure used to select the best performing models and configurations. In addition, an overview of the used datasets is provided. Section 3 provides some insights on the area where the described methodology is applied. Section 4 90 presents and discuss the most important outcomes for both floods and droughts. Section 5 summarises the main findings of the study, highlighting their meanings for the study case and analysing the limitations of the proposed approach, while providing insight on possible future developments.

Methodology
Machine learning is a subset of artificial intelligence whose main purpose is to give computers the possibility to learn, through-95 out a training process, without being explicitly programmed (Samuel, 1959). It is possible to distinguish machine learning models based on the kind of algorithm that they implement and the type of task that they are required to solve. Algorithms may be divided into two broad groups: the ones using labelled data (Maini and Sabri, 2017), also known as supervised learning algorithms, and the ones that during the training receive only input data for which the output variables are unknown (Ghahramani, 2004), also called unsupervised learning algorithms. 100 As previously mentioned, in index insurance, payouts are triggered whenever measurable indices exceed predefined thresholds. From a machine learning perspective, this corresponds to an objective classification rule for predicting the occurrence of loss or no loss based on the trigger variable. The rule can be developed using past training sets of hazard and loss data (supervised learning). Conceptually, the development of a parametric trigger should correspond to an informed decision-making process i.e. a process which, based on data, a-priori knowledge and an appropriate modelling framework, can lead to optimal 105 decisions and effective actions. This work aims to leverage the aptitude of machine learning, particularly supervised learning algorithms, to support the decision-making process in the context of parametric risk transfer, applying NN and SVM for the identification of extreme weather events, namely flooding and drought for this particular study.
Consider the occurrence of losses caused by a natural hazard on each time unit t = 1, ..., T over a certain study area G, and let L t be a binary variable defined as The aim is then to predict the occurrence of losses based on a set of explanatory variables obtained from non-linear transformations of a set of environmental variables. This hybrid approach aims to capture some of the physical processes of how the hazard creates damage by incorporating a priori expert knowledge on environmental processes and damage-inducing mechanisms for different hazards. Raw environmental variables are not always able to fully describe complex dynamics like flood 115 induced damage, therefore, the usage of expert knowledge is important to provide the machine learning model with input data that are able to better characterise the natural hazard events.
Supervised learning with machine learning methods based on physically-motivated transformations of environmental variables are then used to capture loss occurrence. The models are set up such that they produce probabilistic predictions of loss rather than directly classifying events in a binary manner. This allows the parametric trigger to be optimised in a subsequent 120 step, in a metrics-based, objective and transparent manner, by disentangling the construction of the model from the decisionmaking regarding the definition of the payout-triggering threshold. Probabilistic outputs are also able to provide informative predictions of loss occurrence that convey uncertainty information, which can be useful for end users when a parametric model is operational (Figueiredo et al., 2018).

Variable and datasets selection
The data-driven nature of ML models implies that the results yielded are as good as the data provided. Thus, the effectiveness of the methods depends heavily on the choice of the input variables, which should be able to represent the underlying physical process (Bowden et al., 2005). The data selection (and subsequent transformation) therefore requires a certain amount of a priori knowledge of the physical system under study. For the purpose of this work, precipitation and soil moisture were used 130 as input variables for both flood and drought. An excessive amount of rainfall is the initial trigger to any flood event (Barredo, 2007), while scarcity of precipitation is one of the main reasons that leads to drought periods (Tate and Gustard, 2000). Soil moisture is instead used as a descriptor of the condition of the soil. With the idea to implement a tool that can be exploited in the framework of parametric risk financing, we selected the datasets to retrieve the two variables according to five criteria: 1. Spatial resolution: a fine spatial resolution that takes into account the climatic features of the various areas of the con-135 sidered country is needed to develop accurate parametric insurance products.
2. Frequency: the selected datasets should be able to match the duration of the extreme event that we need to identify. For example, in the case of floods, which are quick phenomena, daily or hourly frequencies are required.
3. Spatial coverage: global spatial coverage enables the extension of the developed approach to areas different from the case study region. 4. Temporal coverage: since extreme events are rare, a temporal coverage of at least 20 years is considered necessary to allow a correct model calibration. 5. Latency time: a short latency time (i.e., time delay to obtain the most recent data) is necessary to develop tools capable of identifying extreme-events in near-real time.
Based on a comprehensive review of available datasets, we found six rainfall datasets and one soil moisture dataset, compris-145 ing 4 layers, matching the above criteria. With respect to the studies analysed in Mosavi et al. (2018), Hao et al. (2018) andFung et al. (2019), that associated a single dataset to each input variable, here six datasets are associated to a single variable (rainfall).
The use of multiple datasets is able to improve the ability of models in identifying extreme events, as demonstrated for example by Chiang et al. (2007) in the case of flash floods. In addition, single datasets may not perform well; the combination of various datasets produces higher quality estimates (Chen et al., 2019). Two merged satellite-gauge products (the Climate Haz-

Data transformation
The raw environmental variables are subjected to a transformation which is dependent on the hazard at study and is deemed more appropriate to enhance the performances of the model, as described below.

Flood
Flood damage is not directly caused by rainfall, but from physical actions originated by water flowing and submerging assets usually located on land. As a result, even if floods are triggered by rainfall, a better predictor for the intensity of a flood and consequent occurrence of damage is warranted. To achieve this, we adopt a variable transformation to emulate, in a simplified manner, the physical processes behind the occurrence of flood damage due to rainfall, based on the approach proposed by 165 Figueiredo et al. (2018), which is now briefly described.
Let X t (g j ) represent the rainfall amount accumulated over grid cell g j belonging to G on day t. Potential runoff is first estimated from daily rainfall. This corresponds to the amount of rainwater that is assumed to not infiltrate the soil and thus remain over the surface, and is given by where u is a constant parameter that represents the daily rate of infiltration.
Overland flow accumulates the excess of rainfall over the surface of a hydrological catchment. This process is modelled using a weighted moving time average, which preserves the accumulation effect and allows the contribution of rainfall on previous days to be considered. The moving average is restricted to a three-day period. The potential runoff volume accumulated over cell g j over days t, t − 1, t − 2 is thus given by where θ 0 , θ 1 , θ 2 > 0 and θ 0 + θ 1 + θ 2 = 1 Finally, let Y t be an explanatory variable representing potential flood intensity for day t, which is defined as by fitting a logistic regression model to concurrent potential flood intensity and reported occurrences of losses caused by flood events, and maximising the likelihood using a quasi-Newton method:

Drought
Before being processed by the ML model, rainfall data are used to compute the standardised precipitation index (SPI). The SPI is a commonly used drought index, proposed by Mckee et al. (1993). Based on a comparison between the long-term 190 precipitation record (at a given location for a selected accumulation period) and the observed total precipitation amount (for the same accumulation period), the SPI measures the precipitation anomaly. The long-term precipitation record is fitted to the gamma distribution function, which is defined according to the following equation: where α and β are respectively the shape factor and the scale factor. The two parameters are estimated using the maximum 195 likelihood solutions according to the following equations: where x is the mean of the distribution and N is the number of observations.
The cumulative probability G(x) is defined as Since the gamma function is undefined if x = 0 and precipitation can be null, the definition of cumulative probability is adjusted to take into consideration the probability of a zero: where q is the probability of a zero. H is then transformed into the standard normal distribution to obtain the SPI value: where φ is the standard normal distribution.
The mean SPI value is therefore zero. Negative values indicate dry anomalies, while positive values indicate wet anomalies. Table 3 reports drought classification according to the SPI. Conventionally, drought starts when SPI is lower than -1. The drought event is ongoing until SPI is up to 0 (Mckee et al., 1993). The main strengths of the SPI are the fact that the index 210 is standardized, therefore it can be used to compare different climate regimes, and that it can be computed for various accumulation periods (World Meteorological Organization and Global Water Partnership, 2016). In this study, SPI1, SPI3, SPI6 and SPI12 were computed, where the numeric values in the acronym refer to the period of accumulation in months (e.g., SPI3 indicates the standard precipitation index computed over a three months accumulation period). Shorter accumulation periods (1-3 months) are used to detect impacts on soil moisture and on agriculture. Medium accumulation periods (3-6 months) are 215 preferred to identify reduced streamflow and longer accumulation periods (12-48 months) indicates reduced reservoir levels (European Drought Observatory, 2020).

Machine learning algorithms
We now focus on the machine learning algorithms adopted in this work, starting with a short introduction and description of their basic functioning, and next delving into the procedure used to build a large number of models based on the domain of 220 possible configurations for each ML method. Finally, the metrics used to evaluate the models are introduced and the reasoning behind their selection is highlighted.

Neural Network (NN)
Neural networks are a machine learning algorithm composed by nodes (or neurons) that are typically organised into three types of layers: input, hidden and output. Once built, a neural network is used to understand and translate the underlying relationship 225 between a set of input data (represented by the input layer) and the corresponding target (represented by the output layer). In recent years and with the advent of big data, neural networks have been increasingly used to efficiently solve many real-world problems, related for example with pattern recognition and classification of satellite images (Dreyfus, 2005), where the capacity of this algorithm to handle nonlinearity can be put to fruition (Stevens and Antiga, 2019). A key problem when applying neural networks is defining the number of hidden layers and hidden nodes. This must usually be done specifically for each application 230 case, as there is no globally agreed-on procedure to derive the ideal configuration of the network architecture (Mas and Flores, 2008). Although different terminology may be used to refer to neural networks depending on their architectures (e.g, Artificial Neural Networks, Deep Neural Networks), in this paper they are addressed simply as neural networks, specifying where needed the number of hidden layers and hidden nodes. where the output of a layer is the input of the following layer. Each equation is a linear transformation of the input data, multiplied by a weight (w) and the addition of a bias (b) to which a fixed nonlinear function is applied (also called activation function) . . .
The goal of these equations is to diminish the difference between the predicted output and the real output. This is attained by minimising a so called loss function (LF) through the fine tuning of the parameters of the model, the weights. The latter procedure is carried out by an optimiser, whose job is to update the weights of the network based on the error returned by the LF.
The iterative learning process can be summarised by the following steps: 250 1. Start the network with random weights and bias 2. Pass the input data and obtain a prediction 3. Compare the prediction with the real output and compute the LF, which is the function that the learning process is trying to minimise.
4. Backpropagate the error, updating each parameter through an optimiser according to the LF.

255
5. Iterate the previous step until the model is trained properly. This is achieved by stopping the training process when either the LF is not decreasing anymore or when a monitored metric has stopped improving over a set amount of definition.
Specific to the training process, monitoring the training history can provide useful information, as this graphic representation of the process depicts the evolution over time of the LF for both training and validation set. Looking at the history of the training has a twofold purpose: firstly, being the training a minimisation problem, as long as the LF is decreasing the model is 260 still learning, while any eventual plateau or uprising would mean that the model is overfitting (or not learning anymore from the data). The latter is avoided when the LF of training and validation dataset display the same decreasing trend (Stevens and Antiga, 2019). The monitoring assignment is carried out during the training of the model, where its capability to store the value of training and validation loss at each iteration of the process, enable the possibility to stop the training as soon as either losses are decreasing or plateauing over a certain amount of iterations. In this work, the neural network model is created and 265 trained using TensorFlow (Abadi et al., 2016). TensorFlow is an open-source machine learning library that was chosen for this work due to its flexibility, the capacity to exploit GPU cards to ease computational costs, its ability to represent a variety of algorithms and most importantly the possibility to carefully evaluate the training of the model.

Support Vector Machine (SVM)
Support vector machine is a supervised learning algorithm used mainly for classification analysis. It construct a hyperplane (or 270 set of hyperplanes) defining a decision boundary between various data points representing observations in a multidimensional space. The aim is to create a hyperplane that separates the data on either side as homogeneously as possible. Among all possible hyperplanes, the one that creates the greatest separation between classes is selected. The support vectors are the points from each class that are the closest to the hyperplane (Wang, 2005). In parametric trigger modelling, as in many other real-world applications, the relationships between variables are non-linear. A key feature of this technique is its ability to efficiently map 275 the observations into a higher dimension space by using the so-called kernel trick. As a result, a non-linear relationship may be transformed into a linear one. A support vector machine can also be used to produce probabilistic predictions. This is achieved by using an appropriate method such as Platt scaling (Platt, 1999), which transforms its output into a probability distribution over classes by fitting a logistic regression model to a classifier's scores. In this work, the support vector machine algorithm was implemented using the C-support vector classification (Boser et al., 1992) formulation implemented with the scikit-learn 280 package in python (Pedregosa et al., 2011). Given training vectors x i ∈ R p i = i, ..., l and a label vector y ∈ {0, 1} n , this specific formulation is aimed at solving the following optimisation problem: where ω and b are adjustable parameters of the function generating the decision boundary, Ψ i is a function that projects x i into a higher dimensional space, ξ i is the slack variable and C > 0 is a regularisation parameter, which regulates the margin of the decision boundary allowing an increasing number of misclassification for lower value of C and decreasing number of 290 misclassification for higher C (Fig. 3).

Model construction
Below is proposed a procedure to assemble the machine learning models, that involves techniques borrowed from the data mining field and a deep understanding of all the components of the algorithms. The main purpose is to identify the actions required to establish a robust chain of model construction. Hypothetically speaking, one may create a neural network with an 295 infinite number of layers or a support vector machine model with infinite values of the C regularisation parameters. Figure 4, a zoom-in of the ML algorithm box of the workflow shown previously in Fig. 1, describes the steps followed in order to create the best-performing NN and SVM models, from the focus put on the importance of data enhancing to the selection of appropriate evaluation metrics, exploring as many model configuration as possible, being aware of the several parameters comprising these models and the wide ranges that these parameters can have.

300
Pre-processing of data Data preprocessing (DPP) is a vital step to any ML undertaking, as the application of techniques aimed at improving the quality of the data before training leads to improvement of the accuracy of the models (Crone et al., 2006). Moreover, data preprocessing usually results in smaller and more reliable datasets, boosting the efficiency of the ML algorithm (Zhang et al., 2003). The literature presents several operations that can be adopted to transform the data depending on the type of task the 305 model is required to carry out (Huang et al., 2015;Felix and Lee, 2019). In this paper, preprocessing operations were split into four categories: data quality assessment, data partitioning, feature scaling and resampling techniques aimed at dealing with class imbalance. The first three are crucial for the development of a valid model, while the latter is required when dealing with the classification of rare events. Data quality assessment was carried out to ensure the validity of the input data, filtering out any anomalous value (e.g., negative values of rainfall).

310
The partitioning of the dataset into training, validation and testing portions is fundamental to give the model the ability to learn from the data and avoid a problem often encountered in ML application: overfitting. This phenomenon takes place when a model starts overlearning from the training dataset, picking up patterns that belong solely to the specific set of data it is training on and that are not depictive of the real-world application at hand, making the model unable to generalise to sample outside this specific set of data. To avoid overfitting one should split the data into at least 2 parts (McClure, 2017). The training set, 315 upon which the model will learn, and a validation dataset functioning as a counterpart during the training process of the model, where the losses obtained from the training set and those obtained from the validation set are compared to avoid overfitting.
A further step would be to set aside a testing set of data that the model has never seen. Evaluating the performances of the model using data that it has never encountered before, is an excellent indicator of its ability to generalise. Thus, the splitting of the data is key to the validation of the model. In this work, the training of the NN was carried out splitting the dataset in 3 320 parts: training (60%), validation (15%) and testing (25%) set. During training, the neural networks used only the training set, evaluating the loss on the validation set at each iteration of the training process. After the training, the performance of the model was evaluated on the testing set that the model has never seen. Concerning the SVMs, a k-fold cross validation (Mosteller and Tukey, 1968) was used to validate the model, using 5 folds created by preserving the percentage of sample of each class, the algorithm was therefore trained on 80% of the data and its performances were evaluated on 20% of the remaining data that the 325 model had never seen.
Feature scaling is a procedure aimed at improving the quality of the data by scaling and normalising numeric values so as to help the ML model in handling varying data in magnitude or unit (Aksoy and Haralick, 2001). The variables are usually rescaled to the [0, 1] range or to the [−1, 1] range or normalised subtracting the mean and dividing by the standard deviation.
The scaling is carried on after the splitting of the data and is usually calibrated over the training data, and then, the testing set 330 is scaled with the mean and variance of the training variables (Massaron and Muller, 2016).
Lastly, when undertaking a classification task, particular attention should be put on addressing class imbalance, which reflects an unequal distribution of classes within a dataset. Imbalance means that the number of data points available for different classes is significantly different; if there are two classes, a balanced dataset would have approximately 50% points for each of the classes. For most machine learning techniques, little imbalance is not a problem, but when the class imbalance 335 is high, e.g., 85% points for one class and 15% for the other, standard optimization criteria or performance measures may not be as effective as expected (Garcia et al., 2012). Extreme events are by definition rare, hence, the imbalance existing in the dataset should be addressed. One approach to address imbalances is using resampling techniques such as over-sampling (Ling and Li, 1998) and SMOTE (Chawla et al., 2002). Over-sampling is the process of up-sampling the minority class by randomly duplicating its elements. SMOTE (Synthetic Minority Over-sampling Technique) involves the synthetic generation 340 of data looking at the feature space for the minority class data points and considering its k nearest neighbour where k is the desired number of synthetic generated data. Another possible approach to address imbalances is weight balancing, which restores balance in the data by altering the way the model "looks" at the under-represented class. Oversampling, SMOTE and class weight balancing were the resampling techniques deemed more appropriate to the scope of this work, namely, identifying events in the minority class.

Analysis of model configurations
Up to this point, several models characteristics and a considerable amount of possible operations aimed at data augmentation were presented creating an almost boundless domain of model configurations. In order to explore such domain, for each ML method multiple key aspects were tested. Both methods shared an initial investigation of the sampling technique and the combination of input datasets to be fed into the models; all the data resampling techniques previously introduced were tested 350 along with the data in their pristine condition where the model tries to overcome the class imbalance by itself. All the possible combinations of input dataset were tested starting from one dataset for SVM and with two datasets for NN up to the maximum number of environmental variables used. The latter procedure can be used to determine whether the addition of new information is beneficial to the predictive skill of the model and also to identify which datasets provide the most relevant information.
As previously discussed, these models present a multitude of customisable facets and parameters. For support vector ma-355 chine, the regularisation parameter C and the kernel type were the elements chosen as the changing parts of the algorithm. Five different values of C were adopted, starting from a soft margin of the decision boundary moving towards narrower margins, while three kinds of kernel functions were used to find the separating hyperplane: linear, polynomial and radial. The setup for a neural network is more complex and requires the involvement of more parameters, namely, the LF and the optimisers concerning the training process, plus, the number of layers and nodes and the activation functions as key building blocks of 360 the model architecture. Each of the aforementioned parameters can be chosen among a wide range of options; moreover, there is not a clear indication for the number of hidden layers or hidden nodes that should be used for a given problem. Thus, for the purpose of this study, the intention was to start from what was deemed the "standard" for the classification task for each of these parameters, deviating from these standard criteria towards more niche instances of the parameters trying to cover as much as possible of the entire domain.

Evaluation of predictive performance
The evaluation of the predictive performance of the models is fundamental to select the best configuration within the entire realm of possible configurations. A reliable tool to objectively measure the differences between model outputs and observations is the confusion matrix. Table 4 shows a schematic confusion matrix for a binary classification case. When dealing with thousands of configurations and, for each configuration, with an associated range of possible threshold probabilities, it is impracticable to manually check a table or a graph for each setup of the model. Therefore, a numeric value, also called evaluation metric, is often employed to synthesise the information provided by the confusion matrix and describe the capability of a model (Hossin and Sulaiman, 2015).
There are basic measures that are obtained from the predictions of the model for a single threshold value (i.e., value above which an event is considered to occur). These include the precision, sensitivity, specificity and false alarm rate, which take 375 into consideration only one row or column of the confusion matrix, thus overlooking other elements of the matrix (e.g., high precision may be achieved by a model that is predicting a high value of false negatives). Nonetheless, they are staples in the evaluation of binary classification, providing insightful information depending on the problem addressed. Accuracy and F1 score, on the other hand, are obtained by considering both directions of the confusion matrix, thus giving a score that incorporates both correct predictions and misclassifications. The accuracy is the ratio between the correct prediction over all 380 the instances of the dataset, and is able to tell how often, overall, a model is correct. The F1 score is the harmonic mean of precision and recall. In its general formulation derived from Jones and Van Rijsbergen (1976)'s effectiveness measure, one may define a F β score for any positive real β (Eq. 15): where β denotes the importance assigned to precision and sensitivity. In the F1 score both are considered to have the same weight. For values of β higher than one more significance is given to false negatives, while β lower than one puts attention on the false positive.
The goodness of a model may also be assessed in broader terms with the aid of Receiver Operating Characteristic (ROC) and Precision-Sensitivity curves (PS). The ROC curve is widely employed and is obtained plotting the sensitivity against the false 390 alarm rate over the range of possible trigger thresholds (Krzanowski and Hand, 2009). The PS curve, as the name suggests, is obtained plotting the precision against the sensitivity over the range of possible thresholds. For this work, the threshold corresponds to the range of probabilities between 0 and 1. These methods allow evaluating a model in terms of its overall performance over the range of probabilities, by calculating the so-called area under curve (AUC). It should be noted that both ROC curve and the accuracy metric should be used with caution when class imbalance is involved (Saito and Rehmsmeier,395 2015), as having a large amount of true negative tends to result in low value of FPR (or 1-specificity). Table 5 summarises the metrics described above used in this paper to evaluate model performances.
In the context of performance evaluation, it is also relevant to discuss how class imbalance might affect measures that use the true negative in their computation. Saito and Rehmsmeier (2015) tested several metrics on datasets with varying class imbalance, and showed how accuracy, sensitivity and specificity are insensitive to the class imbalance. This kind of behaviour 400 from a metric can be dangerous and definitely misleading when assessing the performances of a ML algorithm and might lead to the selection of a poorly designed model (Sun et al., 2009), emphasising the importance of using multiple metrics when analysing model performances. Lastly, once the domain of all configurations was established and the best settings of the ML algorithms were selected based on the highest values of F1 score and area under the PS curve, the predictive performances of the models were compared to those of logistic regression (LR) models. The logistic regression adopted as a baseline takes as 405 input multiple environmental variables, in line with the procedure followed for the ML methods and used a logit function (Eq. 6) as link function, neglecting interaction and nonlinear effects amid predictors. The logistic regression is a more traditional statistical model whose application to index insurance has recently been proposed, and can be said to already represent in itself an improvement over common practice in the field (Calvet et al., 2017;Figueiredo et al., 2018). Thus, this comparison is able to provide an idea about the overall advantages of using a ML method.

Case study
This study adopts the Dominican Republic as its case study. The Dominican Republic is located on the eastern part of the island of Hispaniola, one of the Greater Antilles, in the Caribbean region. Its area is approximatively 48, 671km 2 . The central and western parts of the county are mountainous, while extensive lowlands dominate the southeast (Izzo et al., 2010).
The climate of the Dominican Republic is classified as "tropical rainforest". However, due to its topography, the country's per year, depending on the region . The six considered rainfall datasets (described in Table 1) exhibit considerable differences in average annual precipitation values over the Dominican Republic (Fig. 5) the average soil moisture (Fig. 6). The central regions are the wettest, while the driest areas are located on the coast. There are no significant differences among the four soil moisture layers.
Weather-related disasters have a significant impact on the economy of the Dominican Republic. The country is ranked as the 10th most vulnerable in the world and the second in the Caribbean, as per the Climate Risk Index for 1997-2016 report (Eckstein et al., 2017). It has been affected by spatial and temporal changes in precipitation, sea level rise, and increased 430 intensity and frequency of extreme weather events. Climate events such as droughts and floods have had significant impacts on all the sectors of the country's economy, resulting in socio-economic consequences and food insecurity for the country.
According to the International Disaster Database EMDAT (CRED, 2019), over the period from 1960 to present, the most frequent natural disasters were tropical cyclones (45% of the total natural disasters that hit the country), followed by floods (37%) . Floods, storms and droughts were the disasters that affected the largest number of people and caused huge economic 435 losses.
The performances of a ML model are strictly related to the data the algorithm is trained on, hence, the reconstruction of historical events (i.e., the targets), although time-consuming, is paramount to achieve solid results. Therefore, a wide range of text-based documents from multiple sources have been consulted to retrieve information on past floods and droughts that hit the Dominican Republic over the period from 2000 to 2019. International disasters databases, such as the world renowned EMDAT,

440
Desinventar and ReliefWeb have been considered as primary sources. The events reported by the datasets have been compared with the ones present in hazard-specific datasets (such as FloodList and the Dartmouth Flood Observatory) and in specific literature Herrera and Ault, 2017) to produce a reliable catalogue of historical events.
Only events reported by more than one source were included in the catalogue. Figure 7 shows the past floods and droughts hitting the Dominican Republic over the period from 2000 to 2019. More details on the events can be found in Table A1 445 (floods) and A2 (droughts).

Results and Discussion
The results are presented in this section separating the two types of extreme events investigated, flood and drought. As described in section 2, both NN and SVM models require the assembling of several components. Table 6 collects the number of model configurations explored, broken down by type of hazard and ML algorithm with their respective parameters. The main 450 differences between the ML models parameters for the two hazards reside in which data are provided to the algorithm and which sampling techniques are adopted. The input dataset combination were chosen as follows: 1. All the possible combinations from 1 up to 6 rainfall datasets (for neural networks 2 rainfall datasets were considered the starting point).
2. The remaining combinations are obtained adding progressively layers of soil moisture to the ensemble of six rainfall 455 datasets.
3. The drought case required the investigation of the SPI over different accumulation periods. One, three, six and twelve months SPI were used.
Neural networks and support vector machines alike are able to return predictions (i.e., outputs) as a probability when the activation function allows it (e.g., sigmoid function), enabling the possibility to find an optimal value of probability to assess 460 the quality of the predictions . Therefore, for each hazard, the results are presented by introducing at first the models achieving the highest value of the F1 score for a given configuration and threshold probability (i.e., a point in the ROC or precisionsensitivity space). Secondly, the best performing model configurations for the whole range of probabilities according to the AUC of the precision-sensitivity curve are presented and discussed. The reasoning behind the selection of these metrics is discussed previously, in Section 2.4. As described in the same section, the performances of the ML algorithms are evaluated

Flooding
The flood case presented a strong challenge from the data point of view. Inspecting the historical catalogue of events the case study reported 5516 days with no flood events occurring and 156 days of flood, meaning approximately a 35 : 1 ratio of no event/event. This strong imbalance required the use of the data augmentation techniques presented in Section 2.3.1. The neural 470 network settings returning the highest F1 score were given by the model using all ten datasets applying an over-sampling to the input data. The network architecture was made up of 9 hidden layers with the amount of nodes for each layer as already described, activated by a ReLu function. The LF adopted was the binary cross entropy and the weights update was regulated by an Adam optimiser. The highest F1 score for the support vector machine was attained by the model configuration using an unweighted model taking advantage of all ten environmental variables with radial basis function as kernel type and a C 475 parameter equal to 500 (i.e., harder margin). Figure 8 shows the predictions of the two machine learning models and the baseline logistic regression, as well as the observed events. The corresponding evaluation metrics are summarised in Table 7, which refer to results measured on the testing set, therefore, never seen by the model. Overall, the two ML methods outperform decisively the logistic regression with a slightly higher F1 score for the neural network.
In Fig. 9, panel (a) the highest F1 scores by method are reported in the precision-sensitivity space along with all the points 480 belonging to the top 1% configurations according to F1 score. The separation between the ML methods and the logistic regression can be appreciated, particularly when looking at the emboldened dots in Fig. 9 representing the highest F1 score for each method. Also, the plot highlights denser cloud of orange points in the upper left corner and denser cloud of red points in the lower right corner attesting, on average, an higher sensitivity achieved by the NNs and an higher precision by the SVMs.  configuration is the one that also contains the highest F1 score, whereas for support vector machine the optimal configuration shares the same feature of the one with the best F1 score with the exception of a softer decision boundary in the form of C equal to 100. The results reported in Fig. 10 (a) and (b) about the best-performing configurations are further confirmation of 495 the importance of picking the right compound measurement to evaluate the predictive skill of a model. In fact, according to the metrics using the true negative in their computation (i.e., specificity, accuracy and ROC) , one may think that these models are rather good, and this deceitful behaviour is not scaled appropriately for very bad models. The aim of this work is to correctly identify a flood event rather than being correct when none occur, hence, overlooking the correct rejections seems reasonable.
Panel (a) and (b) of Fig. 10 shed a light on the inaccuracy of the ROC curve and the relative area under the curve (AUC). On the right are displayed the ROC curves, whilst on the left the PS curves of the ideal configurations for each method according to the highest AUC. The points in both curves represent a 0.01 increment in the trigger probability. The receiving operator curve indicates the NN as the worst model being the closest to the 45°line and having, along with SVM, a lower AUC with respect to the logistic model. This signal is strongly contradicted by other metrics and the precision-sensitivity curve, where the red dots are the closest to the upper-right corner where the perfect model resides. The behaviour of these curves is linked, once again, to 505 the disparity in the classes. Additionally, looking at panel (a), all models are pretty distant from the always-positive classifier (i.e., a baseline independent from class distribution represented by the black hyperbole in bold) more appropriate as a baseline to beat than a random classifier (Flach and Kull, 2015).
Panel (c) and (d) shows the behaviour of the prediction return by the ML models over the whole range of probabilities. It is noticeable that although the peak value of F1 score is very close for both ML methods, the neural network displays steadier

Drought
The data transformation for drought required the computation of the SPI from the precipitation data. The SPI was computed for different accumulation periods: Shorter accumulation periods (1-3 months) detect immediate impacts of drought (on soil moisture and on agriculture), while longer accumulation periods (12 months) indicate reduced streamflow and reservoir levels.

525
As shown in tables B1 and B2 models using SPI6 and SPI12 showed the best results and the values of the metrics are close to each other, thus, for brevity and in favour of clarity only one of the two is reported, namely, SPI over a six month accumulation period. Contrary to the flood case, the drought historical catalogue of events reported 1283 weeks with no droughts and 696 weeks of drought, with a ratio around 1.85 : 1 of no event/event. Albeit balanced, models with weights assigned were also investigated. The performances of the neural networks and the support vector machines were evaluated, like before, by a set 530 of evaluation metrics and curves and a comparison against a logistic regression. It is important to point out that SPI is updated at weekly scale, same temporal resolution of the predictions, implying that each week counts as an event. Considering the duration of the drought in our historical catalogue of events (i.e., 17 weeks for the shortest one and 148 weeks for the longest), the temporal resolution adopted is an aspect to keep in mind when analysing the results obtained for these models. The highest F1 scores for the drought case were obtained for the NN model using all the datasets with weights for the classes. The network 535 architecture was made up of 8 hidden layers with the relative amount of nodes, activated by a ReLu function. The LF adopted was the binary cross entropy and the parameters update were regulated by an Adam optimiser. Regarding the SVM, the highest F1 score was achieved from the unweighted model using all ten environmental variables with radial basis function as kernel type and a C parameter equal to 100. Figure 12 displays the predictions of the three methods against the observed events. In particular, it is possible to appre-540 ciate the reduction of false positive provided by NN and SVM. The strong improvements brought by the ML algorithms are confirmed by the metrics summarised in Table 8, where NN and SVM show high values across all the prediction skill measurements. The NN results as the most accurate model, while the SVM is the more precise overall. The implementation of either model should take into account the job that these models are required to take on. If the purpose of the model is to balance the occurrences of false alarms and missed events, the NN is preferable. For a task that would require a stronger focus on 545 the minimisation of false positives (i.e. reduce the number of false alarms), the SVM should be used. Figure 13 remarks the distance between the ML methods and the logistics regression as well as echoes what is observed for flood that the points for NN gravitate towards the area of the plot with higher sensitivity value while the SVM points tend to stay on the precision side of the plot.
The addition of further datasets is still beneficial to the performances of the ML methods as displayed by Fig. 13, panel (b).

550
The increasing trend for both ML models starts to slow down from the fourth rainfall datasets onward, which might be due to the redundancy of the rainfall datasets. On the other hand, the addition of the layers of soil moisture improves the performances especially for the support vector machine, which keeps improving steadily reaching the highest value of F1 score when the whole set of information is fed to the model. the points closer to each other towards the area containing the ideal model, which may indicate a more dependable prediction of the events as indicated by panel (c). In fact, while the two configurations have a high value of F1 score for a wide range of probabilities, the neural network has steadier prediction of true positive, false positive and false negative. This behaviour of the neural network could also be linked to the miscalibration of the confidence (i.e., distance between the probability returned by the model and the ground truth) associated with the predicted probability (Guo et al., 2017). The phenomenon arose with the 565 advent of modern neural networks that employing several layers (i.e., tens and hundreds) and a multitude of nodes were able to improve the accuracy of their prediction while worsening the confidence of said prediction. Indeed, a miscalibrated neural network would return a probability that would not reflect the likelihood that the event will occur turning into a numeric output produced by the model.
The features breakdown of the model configurations top one-percent shown in Fig. 15, shows that the best NN configurations 570 are predominantly the ones using weight for the two classes, and the ReLu activation function. Also, a large number of models use a high number of layers in accordance with the configuration with the highest area under the PS curve. The fact that most of the configurations obtaining the best performances have deeper layers may be a confirmation of the miscalibration affecting the estimated probabilities. For the SVM models, Fig. 15 denotes a marked component of the models using harder margins (i.e., high values of C-par) and radial basis function as a kernel.

Conclusions
In this study we developed and implemented a machine learning framework with the aim of improving the identification of extreme events, particularly for parametric insurance. The framework merges a priori knowledge of the underlying physical processes of weather events with the ability of ML methods to efficiently exploit big data, and can be used to support informed decision making regarding the selection of a model and the definition of a trigger threshold. Neural network and support 580 vector machine models were used to classify flood and drought events for the Dominican Republic, using satellite data of environmental variables describing these two types of natural hazard. Model performance was assessed using state of the art evaluation metrics. In this context, we also discussed the importance of using appropriate metrics to evaluate the performances of the models, especially when dealing with extreme events that may have a strong influence on some performance evaluators.
It should be noted that while here we have focused on performance-based evaluation measures, an alternative approach would 585 be to quantify the utility of the predictive systems. By taking into account actual user expenses and thus specific weights for different model outcomes, a utility-based approach could potentially lead to different decisions regarding model selection and definition of the trigger threshold (Murphy and Ehrendorfer, 1987;Figueiredo et al., 2018). This aspect is outside the scope of the present article and warrants further research.
The proposed approach involves a preceding data manipulation phase where the data are preprocessed to enhance the per-590 formances of the ML methods. A procedure aimed at designing and selecting the best parameters for the models was also introduced. Once trained, the ML algorithms decisively outperformed the logistic regression, here used as a baseline for both hazards. The predictive skill of both NN and SVM improved with increasing information fed to the models; indeed, the best performances were always obtained by models using the maximum amount of data available, hinting at the possibility of introducing additional and more diverse environmental variables to further improve the results. While the ML models performed 595 well for both hazards, the drought case showed exceptionally high values for all the adopted model evaluation metrics. This discrepancy in the results between flood and drought might have several explanations. Indeed, the two hazards behave differently both in time and space. On the one hand, the aggregation at national scale is surely an obstacle for a rather local phenomenon like flood. On the other hand, defining a drought event weekly could be misleading since droughts are events spanning several months, even years. Going at a higher resolution (e.g., regional scale) and introducing data describing the terrain of the area 600 should enhance the detection of flood events. For the drought case, introducing a threshold for the number of consecutive weeks predicted before considering an event, or contemplating weekly predictions as a fraction of the overall duration of the event are extensions to this work that deserve investigation to address the issue of potential overestimation of predictive skill.
Neural networks showed more robustness when compared to support vector machines, showing a higher value of F1 score for a wide range of parameters. As already mentioned, this insensitiveness of NN to the probability threshold adopted may 605 be led back to the inability of the model to reproduce probability estimates that are a fair representation of the likelihood of occurrence of the event. Further developments of neural networks models should take into consideration procedures that allow the assessment and the quantification of the confidence calibration of probability estimates.
A preliminary investigation of the characteristics shared by the best-performing model showed that some features are more relevant than others when building the ML model, depending on the type of algorithm and also the type of hazard. An in-depth 610 study of how the performances of the models change when changing model properties could highlight which are the most important properties of the model to tune, speeding up the model construction phase and reducing the computational cost of running the algorithms. It is also worth noting that albeit this work focuses on the application of neural network and support vector machine models, we expect that comparable results could be obtained using other machine learning algorithms, which calls for further research.

615
Although several issues raised in this article warrant further research, there is clear potential in the application of machine learning algorithms in the context of weather index insurance. The first reason for this is strictly linked to the performances of the models. Indeed, the capability of these algorithms to reduce basis risk with respect to traditional methods could play a key role in the adoption of parametric insurance in the Dominican context and more generally for those countries that posses a low level of information about risk. The second aspect, perhaps the most intriguing from the weather index insurance point of view, 620 regards the ability of these algorithms to utilise and improve their performances using a growing amount of information (i.e., increasing the number of input variables). Indeed, the significant advances in data collection and availability observed in the last decades (i.e., improved instruments, more satellite missions, open access to data store services) made it so that vast amount of data are readily and freely available on a daily basis. Being able to rely on global data that are disentangled from the resources of a given territory, both from the point of view of climate data (e.g., lack of rain-gauge networks) and from the point of view of 625 information about past natural disasters, is an important feature of the work presented that would make the proposed approach feasible and appealing for other countries. Furthermore, similar technological improvements might be expected in the further development of machine learning algorithms. The scientific evolution of these models, and the possibility of establishing a pipeline that automatically and objectively trains the algorithm over time with updated and improved data (always allowing the monitoring of the process), are other appealing features of these kind of models. In conclusion, the framework presented 630 and topics discussed in this study provide a scientific basis for the development of robust and operationalisable ML-based parametric risk transfer products.
Data availability. The six rainfall datasets (CCS, CHIRPS, CMORPH, GSMaP, IMERG and PERSIANN) and the soil mosture dataset (ERA5) are freely available at the links cited in the references.
Appendix A: Catalogue of historical events 635 The following tables report the catalogue of historical events for floods (Table A1) and droughts (Table A2).
Appendix B: Performance of NN and SVM in drought events identification when using different SPI accumulation periods The following tables report the performances of NN (Table B1) and SVM ( Chen, T., Ren, L., Yuan, F., Tang, T., Yang, X., Jiang, S., Liu, Y., Zhao, C., and Zhang, L.: Merging ground and satellite-based precipita-