Detecting the presence of Tropical Cyclones using Deep Learning Techniques

,


Introduction
Tropical Cyclones (TCs) are extreme weather events that can leave devastating effects on human populations; for example, Hurricane Irma impacted the Caribbean Islands and the Southeast USA in September 2017 causing 47 direct deaths, 82 indirect deaths, hundreds of injuries and an estimated monetary damage of around 50 billion USD (Cangialosi et al. 2018).TCs can be detected and tracked both in satellite data and in simulations carried out using numerical weather prediction (NWP) or climate models.
Climate models can be used to understand how the properties of TCs and other meteorological phenomena might evolve in a changing climate, but such Global Circulation Models (GCMs) produce a large amount of data.A method to filter this data in order to target analysis would be useful; and such a method is presented here, based on a deep learning model that detects the presence of tropical cyclones in simulation output.While tested offline (i.e. after data has been output), the method is intended for eventual deployment online (i.e. while a simulation model is running) so as to preclude the need to output data which does not include TCs (at least for the situation where TCs are the product of interest).The method presented here is lightweight, with relatively short training and inference times, and requires no explicit a-priori thresholds in meteorological variables.It is also shown to perform at least as well as other more standard, and more complex, deep learning models.
Following motivation and a description of previous work, the deep learning model itself is presented, along with a description of the data used for training and validation.The performance of the model is then discussed, both in terms of metrics of success (precision and recall) and in comparison with other deep learning models.A range of techniques were used to attempt to explain the results obtained.

a. Motivation
Data volumes from climate simulations are huge.The current phase of the climate model intercomparison project (Eyring et al. 2016, CMIP6) comprises hundreds of different model simulations, is not yet complete, and could yet produce 18 PB of data (Balaji et al. 2018).Much more data was produced and analysed in the production of the 2 archived datasets.Such data are costly to store and maintain and the volume makes analysis difficult.
A method of automatically detecting interesting phenomena in such data could have two major benefits: 1.A fast way of finding data suitable for subsequent manual analysis increasing scientific productivity, and 2. being able to trigger storing of data during simulation if the relevant phenomena are detected reducing the need to store all data periodically -leading to more efficient science (efficient in time saved, storage costs, and storage energy consumption).
The possibility of efficiencies arise because although many simulations are carried out to target multiple usecases, some are carried out to investigate specific phenomena (e.g. when checking the impact of resolution on simulated TCs as in Roberts et al. 2015).In these cases, data relating to other phenomena might not be needed.However, currently in order to be able to retrieve the data for specific phenomena, the simulation will store sufficient data for post-processing analysis at fixed intervals.The first post-processing step then involves the retrieval of only relevant data, the other data is not used.
To select the correct data for analysis, it is important to have confidence in the method used for identifying the feature of interest.There is sometimes a conflict between the abstract notion of the feature of interest (in this case a TC) and the practical implementation of a definition of for a TC -the latter is intimately related to the tool for discovering it.For example, if the practical definition of a TC is the same as the metric for detecting it, of course we have confidence in it -but this definition may miss (or include) things which we would abstractly consider to be TCs (or detect phenomena which we would not consider to be TCs, but fall inside a poorly drawn definition).We show some examples of this later.Many previous techniques for TC identification in numerical data generally conflate the detection method with the definition -with deep learning it is clear that will not be the case, so understanding the distinction is important.
Although our initial interest is in detecting TCs, the method is expected to be extensible to other important phenomena such as fronts, atmospheric rivers etc.

Previous Work
Several methods used to detect TCs have been developed.Most operate by using thresholds set for a few meteorological variables to determine the presence of a tropical cyclone.The use of thresholds leads to two problems: setting such thresholds involves scientific subjectivity, and the combination of method and threshold may not be transferable across different models or data.More recently deep learning has been used, and while deep learning may suffer from aspects of the transferability problem, it should be possible to avoid subjectivity.

a. TC detection using conventional techniques
Conventional techniques usually work by identifying TC centres by applying various thresholds to the available data.TC tracks are then created by arranging the identified TC centres according to some mathematically-based method.The following show a few examples of such methods, with a tubular summary in Table 1.Kleppek et al. (2008) use multiple thresholds to identify TC centers.The first is that a local minimum of sea-level pressure (SLP) needs to observed within a neighbourhood of eight grid points.This is assigned as a storm centre.For it to be a TC centre, a maximum relative vorticity at 850hPa above 5 × 10 −5 s -1 and positioned at the storm centre needs to be present.The presence of vertical wind shear between 850hPa and 200hPa of at least 10 ms -1 is also required, as well as an event lifetime of 36 or more hours.Finally, if the storm centre is over land, the relative vorticity condition has to be fulfilled or the wind speed maximum at 850hPa needs to be inside 250km from the TC centre.
Similarly, Vitart et al. (1997) used the closest minimum of mean sea level pressure (MSLP) to a local maximum of relative vorticity at 850hPa over 3.5 × 10 −5 s -1 as a storm centre.For it to be a TC centre, the closest local maximum of the average temperature between 550hPa and 200hPa must be within 2 o of the storm centre and the temperature decreases by at least 0.5 o C for at least 8 o latitude in all directions.Also, the closest maximum thickness between 1000hPa and 200hPa must be within 2 o of the storm centre and the thickness must decrease at least 50 metres for at least 8 o latitude in all directions.Camargo and Zebiak (2002) introduce a detection method that uses vorticity at 850hPa, surface wind speed and a vertically integrated temperature anomaly as variables on which to impose basin-dependant thresholds.A final example is Roberts et al. (2015) who use the method explained by Hodges (1995Hodges ( , 1996Hodges ( , 1999) ) and Bengtsson et al. (2007), where a TC is identified by a maximum of 850hPa relative vorticity, in data which has been spectrally filtered using a T42 filter (i.e.keeps features greater than 250km in scale) and a warm-core check on a T63 grid (to keep features larger than 180km) using the 850hPa, 500hPa, 300hPa and 200hPa levels.draw a box around a suspected TC.Given the size of the inputs and number of kernels used in the convolution layers, the model presented was expected to be time consuming to train.It was; an adaptation of this deep learning model was trained using 9622 nodes of 68 cores each with a peak throughput of 15.04PF/s and reached a sustained throughput of 13.27 PF/s, although the total time to train was not reported (Kurth et al. 2017).The accuracy for this model was specified as the percentage of overlap between the predicted box and the box given as the ground truth -an Intersection of Union (IOU) -which was created using the Toolkit for Extreme Climate Analysis (TECA) Prabhat et al. (2012Prabhat et al. ( , 2015)).The model had 24.74% of the predicted boxes having at least an overlap of 10% with the ground truth, while 15.53% of the predicted boxes had at least an overlap of 50% with the ground truth.Mudigonda et al. (2017) created a deep learning model which used integrated water vapour (IWV) snapshots and image segmentation techniques to classify whether each pixel in an image was a part of a TC or not.It used an adaptation of the Tiramisu model, which applies the DenseNet architecture to semantic segmentation.The labels were created using TECA Prabhat et al. (2012Prabhat et al. ( , 2015) ) and Otsu's method Otsu (1979).It was trained and tested on images which had at least 10% of the pixels which were not background pixels.An accuracy of 92% was obtained but it was noted that had the model predicted all the pixels as being all background pixels, the accuracy would have been of 98%.

b. TC detection using Deep Learning
Finally, Liu et al. (2016) used an image made up of 8 different meteorological channels cropped in such a way that if a TC was present, it was centred in the image, and then predicted whether the image was one of a TC or not.The model obtained a 99% accuracy with a relatively simple model, but the pre-processing cropping step involved significant noise reduction, which would have helped obtain good performance.
These three approaches are summarized in Table 2.

Deep Learning Model
This section presents the data used to train the deep learning model, the model architecture, and summarises the method used to develop it.Full details of the training appear in the appendices.

a. Data
The deep learning model, referred as TCDetect for the rest of this study, was trained, tested and validated on data extracted from the ERA-Interim reanalysis Dee et al. (2011), with the validation data used for manual hyperparameter tuning as described in Appendix B and the testing set used for producing the final testing statistics and for interpreting the results produced by the trained model.The training and validation sets used data from 1 st of January 1979 until the 31 st of July 2017.Five six-hourly fields were used: mean-sea level pressure (MSLP), 10metre wind speed, vorticity at 850hPa, vorticity at 700hPa and vorticity at 600hPa, each at a spatial resolution of ≈ 0.7 o x0.7 o .Spherical filtering was performed on each field to reduce some of the smaller scale features, as described in Appendix B.
Each field was further split into eight regions (Figure 1).The regions are loosely based on those used by IB-TrACS.The resulting dataset had 450,944 cases, each with dimensions of 86 rows, 114 columns and 5 channels.
Labels for these cases were derived from the International Best Track Archive for Climate Stewardship (IB-TrACS) Knapp et al. (2010Knapp et al. ( , 2018) ) dataset, which contains temporal information, a category, and latitude and longitude of the storm centre for all major storms across the globe.The labels used were simply the presence or absence of a TC.At the end of the labelling process, 22,826 (5.06%) positive cases were identified as well as 428,118 (94.94%) negative cases.
Training and validation datasets were extracted by taking data from 1979, 1986, 1991, 1996, 2001, 2006, 2011 and 2016 for the validation set and the rest of the the period was used to make up the training set.Splitting the data in this way was done so that the possible effects of a changing climate were taken into consideration; any hyperparameter tuning performed would not be skewed by underlying nonstationarity.The resulting training set had a total of 357408 cases, with 339,546 (95.00%) not having a TC and 17,862 (5.00%) with a TC.The validation set had a total of 93,504 cases, with 88,651 (94.81%) not having a TC and 4,853 (5.19%) with a TC.
Data from the 1 st of August 2017 until the 31 st of August 2019 was used as a testing dataset.This had a total of 24,352 cases, with 23,010 (94.49%) not having a TC and 1,342 (5.51%) having a TC present.
Table 3 shows how the splits are made and that the split between positive and negative cases is mostly kept.
All data was preprocessed to reduce resolution by a sixteenth by taking the mean value of all data points in a 4x4 box, resulting in 22 rows, 29 columns and 5 channels in each case.In order to to standardize each value around  1980 -1981, 1982 -1985, 1987 -1990 1992 -1995, 1997 -2000, 2002 -2005 2007 -2010, 2012 -2015, 2017 17862 (5.00%) 339546 (95.00%) 1979, 1986, 1991, 1996 2001, 2006, 2011, 2016 4853 (5.19%)  0 with a standard deviation of 1 each of the fields used as input variable for a channel was normalised using where field is the mean of the values in the field and field is the standard deviation of the values in the field.Figure 2 shows an example of the preprocessed data for the 28-08-2005-18Z case; the time when Hurricane Katrina obtained its maximum strength.F . 2. An example of the preprocessed data used to train TCDetect.A single case consists of the fields of Mean Sea Level Pressure (MSLP) (top left), 10-metre wind speed (top right), vorticity at 850hPa (middle left), vorticity at 700hPa (middle right) and vorticity at 600hPa (bottom).This example is from 28-08-2005-18Z, i.e. when Hurricane Katrina obtained its maximum strength.F . 3. Visual representation of the architecture of TCDetect.The inputs, having 22 rows, 29 columns and 5 fields, are passed through a 2x2 convolution window whose weights are learnt producing 8 feature maps, but losing one row and column.The resulting feature maps are passed through a MaxPool window which takes the maximum value in its 2x2 window.This further reduces the feature map size by a row and a column, to 20 rows and 27 columns.This is repeated for 3 more times, with each step producing more feature maps.These are then combined and reshaped into one long array.This array is used as the input to a fully-connected layer of 128 nodes, where all the values in the array are passed onto a mathematical calculation which gives out a single value.128 values are obtained and used as inputs to the next layer.This is repeated three times to produce the final inference.

b. Architecture
The architecture of TCDetect uses a convolutional base attached to a fully-connected classifier which outputs a value between 0 and 1, with any values larger than 0.5 signifying that the model detects a TC and any values smaller or equal to 0.5 meaning that the model does not detect a TC.A more detailed explanation of the model architecture can be found in Appendix A, while a graphical view is shown in 3.
To arrive at the model architecture, manual hyperparameter tuning was used to determine which changes to the architecture performed well (see Appendix B).
Various metrics could have been used.Accuracy, defined as × 100, where TP is the number of true positives, FP is the number of false positives, TN is the number of true negatives and FP is the number of false positives, was considered, but was not suitable due to the large class imbalance present in our training set.The model would train to produce an inference of "no TC" for all cases, thus setting TN to a high number, producing a high accuracy, but a model with no skill.Given that obtaining the highest possible number of TCs is important for the use of the developed model, recall, defined as + , could have been used, as a high value would indicate that the number of false negatives, i.e. not detected TC cases, is small.However, this would not have kept in mind the need to also keep the skill of detecting non-TC cases as such.For this, precision, defined as + , could be used as a high value would indicate that the number of false positives, i.e. non-TC cases inferred as TC cases, is small.To get the right balance between the two functions of the model, the Area-under-Curve for the Precision-Recall curve (AUC-PR) was used.This gives a single value which takes into account the two important functions of the model.This value can still be slightly obscure as the same value could be produced for high precision and low recall rates, high recall and low precision rate or average recall and precision rates.However, as the model was being developed to identify data for further post-processing, false negatives would be a bigger problem than false positives, and so improvements in recall were favoured over those in precision if AUC-PR varied only marginally or the balance between recall and precision needed to be addressed as a change was assessed.
The bulk of the training was done using a JASMIN (Lawrence et al. 2012) NVIDIA Volta 100 GPU node with 32GB of RAM and 32 CPU cores.The software packages used were Python 3.6.8and Tensorflow v2.20 Abadi et al. (2015).Manual hyperparameter optimisation was undertaken as described in Appendix B, with 10-fold cross-validation utilised.This meant considerable time and computational resources were dedicated to this process.However, training for the final model needed to traverse the training set 21 times to converge to a solution with a total training of 12 minutes.

Results
The resulting deep learning model, TCDetect, was evaluated against the test set described above.The inferences obtained were also investigated to understand how the model generates its results, with the aim of demystifying the deep learning model.We present these results in this section.

a. Model Statistics
After training the model correctly classified 1231 (91.73%) of the 1342 cases having a TC and 20844 (90.59%) of the 23010 cases not having a TC.It misclassified 111 (8.27%) cases in which a TC was present and 2166 (9.41%) cases in which a TC was not present.(Table 4.) These results correspond to an accuracy of 90.65%, a recall rate of 91.73% and a precision rate of 36.24%.This is a sufficiently high recall rate and precision rate for the model to be used as a data filtration technique, however, if a different balance between recall and precision was desired, the value which is the boundary between a positive and a negative prediction (currently 0.5) could be changed (e.g. Figure 4 shows the AUC-PR curve for the model with the values at each point signifying the boundary at which the corresponding recall and precision rates are obtained).

b. Comparison with Standard Models
There are many existing deep learning standard models, so it is reasonable to ask "Would any of those do better than the model developed here?".
To test this, convolutional bases from a rich variety of standard models were compared: DenseNet121  Chollet (2017).The convolutional bases, i.e. the part of the models that learn the patterns needed, are added to the fully-connected classifier developed in this paper.The weights for the convolutional bases obtained when training from the ImageNet dataset (Deng et al. 2009) were used and the weights in the classifier were trained using the testing dataset with the hyperparameters of the presented model.
Given that these convolutional bases required inputs of at least 75 pixels by 75 pixels with 3 channels, some changes to the inputs were required.Firstly, as an input with only 3 channels is required for the most of the architectures, the fields retained were those of vorticity at 850hPa, vorticity at 600hPa and MSLP.These fields were deemed the most influential for the model being presented in this study by tests detailed in Section 5a.Secondly, the input size was extended five fold from 22x29 pixels to 110x145 pixels by interpolating any intermediate values.
Of the standard architectures tested (see figure 5 for all results), none managed to obtain a better AUC-PR value than TCDetect on the test set.TCDetect was the most certain of its inferences, as shown by achieving the lowest loss value, i.e. the average difference between the inference and label, on the test set.Table 5 compares the complexity of some of these more standard models and their performance metrics to the model being presented here.All of the standard models had far higher complexity in terms of the number of parameters than the one described herewhich also outperforms the others in terms of AUC-PR, precision rate and loss.While recall for the model being presented here is not outperforming all of the other models it is competitive with the best.

Model Explainability
It is not always obvious how a deep learning algorithm arrives at an outcome given a set of inputs.In this section, the explainability of the results of this deep learning model are addressed with the aim of understanding when and where it is more or less likely to perform well.Four factors are addressed: feature importance, to determine which inputs influence the inferences most; locality, how well the model represents cyclones regionally; cyclone strength, the impact of strength on detectabilty; and size, how is the size of the training dataset influencing confidence?

a. Feature Importance
One important aspect when building the model was to quantify the relative importance of each input field to the model results.Two methods are employed for this: the Breimann method (Breiman 2001) and the Lakshmanan method (Lakshmanan et al. 2015).
The Breimann method involves randomly permuting the data from one field across all the test cases and then retesting the model with this modified dataset.A decrease in the model's performance is expected, with the most impor-tant field obtaining the largest decrease in performance.The Lakshmanan method involves several steps: Firstly, permuting the data as in the previous method for one field.Once the field with the most importance is found, i.e., the field which produces the largest decline in performance, it is kept permuted, while the others fields are permuted individually.The next most important field is now found repeating the algorithm on the remaining fields.This process keeps on going until all the fields are permuted.
Both methods were performed 30 times each and in each case an average was taken to make sure of consistent and robust results (Figure 6).
These results suggest this deep learning prioritises highvorticity at 850 hPa (and the associated deep convection) with the other inputs roughly equally weighted.

b. The importance of locality
During development of TCDetect, a number of optimisations were carried out (Appendix B) and due to time and computational constraints, the manual hyperparameter tuning process was performed on data from the Western Atlantic and Western Pacific (WAWP) regions.When doing this, two assumptions were made: that any change made to the architecture which caused an improvement in the model performance would result in a similar improvement when the architecture was trained and tested on data from all regions; and that a model trained on data from the WAWP regions would generalise well when tested on data from all regions of the world.The first and third columns of Table 6 show that the first assumption holds, although it can be seen that the magnitude of the improvements between the two models can vary.Also, as shown in Table 7, the architecture has similar performance when trained and tested only on data from the WAWP regions and when trained and tested on data from all regions.
However, the second assumption was found to not hold.The first and second columns of Table 6 show that a model trained on data from the WAWP regions decreased in performance considerably when validated on data from all regions.This is mirrored when using the final models, as shown in Table 7.
A plausible reason for this is that the data from the WAWP regions is not representative of all regions.A comparison of the mean training inputs between the WAWP regions and the whole world (Figure 7) shows significant differences; hence the model trained only on WAWP data might be trying to find a different pattern than that trained on data from all regions.
To further understand how the model trained on WAWP data differs from that trained on data from all regions, the results have been split by basin to examine differences (Table 8) As expected, the model trained on WAWP data performs best on the Western Atlantic and Western Pacific regions, with a recall of 90.80% and 90.75% respectively.It also performs well in the Eastern Pacific region with a recall of 80.15%.However, all other regions do not surpass a recall rate of 60%.
When the model trained on data from all regions is used, all recall rates improve, some significantly.The most improved region is the South Western Pacific where recall rate increases by more than half from 58% to 93%.All but one region obtained a recall rate of at least 80%, with many surpassing a recall rate of 90%.The only region that did not do well was the South Atlantic which obtained a recall rate of 42%.There are only a few TCs ( 26 for this region so no conclusions as to why that might be are drawn.

c. Performance by Strength of Tropical Cyclone
A cursory investigation of incorrect classifications indicated that stronger tropical cyclones are picked up better, and so the recall as a function of tropical cyclone category was investigated.Not surprisingly, recall rate improves with cyclone strength as indicated by the Saffir-Simpson scale (Table 9).All of the cases with a Category 5 (strongest) cyclone present were identified.
The question then arises, could the false positives (TCDetect says cyclone present, IBTrACS says no) be related to detecting an incipient (or decaying) cyclone which is yet to reach (or passed) TC strength?For the intended use case (as a filtering method) such false positives would not be a negative outcome.
TCDetect labelled 3397 cases in the testing dataset as having a TC present, while IBTrACS declared 1342 cases with a TC present.Of the remaining (technically false positives), IBTrACS information suggests only 506 represented a completely meteorologically inappropriate outcome (no met system present).The complete breakdown of these cases is shown in Table 9.It is clear that TCDetect is picking up the required pattern but is mislabelling weaker features as TCs.For an atmospheric dynamics use case, most of the false positives are actually of practical use, and so in this case, a low precision model is not a bad outcome.In fact, arguably, for this use the "practically useful precision" could be calculated as 2891/(2891 + 506) = 85%.The question as to whether better results for the model itself might have been obtained by training using the presence or absence of meteorological events as labels rather than tropical cyclones is deferred to further work.
In any discussion of the relationship between event strength and detectability, it is important to consider both the definitions leading to the label "Tropical Cyclone", and the data used for detection.The standard definition of a TC, that of having sustained winds of 119 km/h, represents a real world measurement.It is known that TC intensities are underrepresented in reanalyses compared to observations (Hodges et al. 2017), and so there will necessarily be an under-representation of cyclone strength in the input data.This might have been exacerbated by pre-processing the input data to a sixteenth of ERA-Interim's original resolution, introducing a further reduction in the resolvable gradients.However, the amount of this preprocessing was chosen during the manual hyperparameter search as that resolution gave the best metrics of success, suggesting that the pattern matching was more important than the values.This of course is consistent with the a prior assumption that an advantage of deep learning over threshold techniques is that the boundary values are less important.Nonetheless, there must have been significant distortion of lower-strength storms, and that may have contributed to the fall-off in recall with falling event strength.

d. Size of Dataset
One could ask whether a larger training dataset would improve results?This was investigated by re-training using the same architecture and varying amounts of data from the available training datasets.Data amounts used for training varied in steps of 10% from 10% to 100% of the training set.The results using global training data (Figure 8) are noisy, but an increasing trend can be seen in the AUC-PR plot.However, the rate of increase is very small.This together with a seemingly constant test loss value show that while more data may improve the performance of the model, this will only be incremental.

Summary
A Deep Learning method to identify the presence or absence of Tropical Cyclones, referred to TCDetect, in simulation data is presented.Trained on ERA-Interim data, TCDetect obtained an AUC-PR of 0.7173 with an recall rate of 92% and a precision rate of 36% on a test set, which was made up of 24352 cases.
TCDetect was also shown to not being able to generalize well when training on cases from the Western Pacific and Western Atlantic basins and testing on cases from the whole domain.
As well as presenting the specific optimisations made to obtain the model (including dropout, early stopping, dataset balancing and different inputs), a selection of standard deep learning models are described and shown not to be able to outperform TCDetect, while being more complex.
The possibility of obtaining a better model had more data been available was investigated, with some indications that more data could have helped.
While the training data was obtained from ERA-Interim, the ground truth used was IBTrACS, which introduces an element of uncertainty in interpreting the results -is an incorrect label (presence/absence) a consequence of the presence or absence of an accurate representatino of the TC in the ERA-Interim data?It is known that reanalysis data cannot resolve the full strength of storms, and so will likely undercount TCs, and hence depress the possible accuracy rates.We discuss the impact of such uncertainties in a companion paper (Galea and Lawrence 2021).
The impact of such issues on detectability is consistent with the result that TCDetect is better at detecting stronger TCS then weaker TCs (weaker TCs will be more poorly represented in reanalysis data).
Future work includes attempting to improve the deep learning model itself to better handle TCs of a low category potentially via ideas imported from other standard techniques, as well as implementing an inference step using a version of the model in a full General Circulation Model to evaluate the pros and cons of avoiding data output.• MSLP, 10-metre wind speed and vorticity at 850hPa, 700hPa and 600hPa • MSLP and 10-metre wind speed with spherical harmonic filtering between wave numbers 5 and 106 • MSLP, 10-metre wind speed with spherical harmonic filtering between wave numbers 5 and 106 and vorticity at 850hPa, 700hPa and 600hPa with spherical harmonic filtering between wave numbers 1 and 63 The last option provided the best mean AUC-PR, that of 0.5309.

b. Early Stopping
Next, it was noted that the model was overfitting as Figure A1 shows that except for the first two epochs, the training loss gets smaller while the validation loss gets larger with am increasing number of epochs.Figure A2 shows similar behaviour with AUC-PR.
To overcome this issue, model training was stopped earlier by stopping training when the training and validation AUC-PR start to diverge.A number of epochs of patience, i.e. the number of epochs to wait until stopping to make sure that training was not stopped too early, were trialled to get the best possible performance.Patience values trailled were of 2, 5, 10 and 20 epochs.That of 10 epochs obtained the best mean AUC-PR of 0.67 88.

c. Normalisation
A few methods for normalisation were trialled, namely of normalising values to lie in the range of 0 to 1 or -1 to 1, standardising value to have a mean of 0 and a standard deviation of 1 and a combination of normalisation and standardisation.The method of standardisation produced the best model performance with a mean AUC-PR of 0.7404.

d. Resolution
Resolution of the data used was next checked.The resolution used up to the current stage was that of the original ERA-Interim dataset, but resolutions of 1.

e. Dataset Balancing
One problem that was known when starting hyperparameter optimisation was that the dataset was heavily dominated by negatively labelled cases.In fact, the training dataset having data from the WAWP regions had 89.46% of the cases negatively labelled, while that having data from all regions had 95% of cases negatively labelled.This split of data would inhibit the model learning the right pattern to maximise its performance.Therefore, six ways of balancing the dataset were investigated.
• Naive Oversampling -Making copies of the positively labelled cases until the dataset is balanced.
• Undersampling without Replacement -Undersample the negatively labelled cases prior to training.Therefore, some data is not used.
• Undersampling with Replacement -Undersample the negatively labelled cases during training, so they change from epoch to epoch.Possible overfitting on positively labelled cases.
• Weighting the Cases -Weighting the cases so that the negatively labelled cases have less influence on the learning process.
• Adding Bias -Add a bias to the output layer to prevent the model from learning the bias.• Weighting the Cases and Adding Bias -A combination of the previous two options.
The option that produced the best performance of a mean AUC-PR of 0.7839 was that of undersampling with replacement.It can be noted that the model's performance decreased marginally from the previous step, but this was still selected as recall became much favoured by the model, which is important for the use case in mind.

f. Loss and Optimiser
The model so far used the binary cross-entropy loss function with the Stochastic Gradient Descent (SGD) optimizer.All possible combinations of the mean absolute error, mean standard error and binary cross-entropy loss functions and SGD, RMSprop, SGD with Momentum using a momentum parameter of 0.9, Adam, Adagrad, Adamax and Nadam optimizers were examined.
The combination which obtained the best mean AUC-PR of 0.7890 was binary cross-entropy loss with the SGD optimiser with Momentum using a momentum parameter of 0.9.

g. Learning Rate and Momentum
A grid search for the best learning rate and momentum parameters was performed.The values for the learning rate included were those of 0.0001, 0.0005, 0.001, 0.005, 0.01 and 0.05 while those used for the momentum parameter were in the range of 0.1 to 1 with a step of 0.1.The combination which produced the best performing model was that having a learning rate of 0.01 and a momentum of 0.8

h. Data Augmentation Methods
Several techniques including random rolls, rotations, adding random noise, flipping the input data along either the x or y directions and random cropping were evaluated.The augmentation rate was set at 50%.The options which obtained a comparative or better mean AUC-PR were rolling the picture along the x-direction, flipping the picture left to right and rotating the image by a random amount.These were all included in the model and the combined methods produced a mean AUC-PR of 0.7988.

i. Data Augmentation Rate
The best data augmentation rate was also varied from 0.1 to 1 in steps of 0.1 to find the best possible rate.The best performing model with a mean AUC-PR of 0.8018 was that with an augmentation rate of 60%.

j. Dropout Position and Rate
Dropout was investigated next.It was trialled in three places, namely the convolutional base only, the fullyconnected classifier only and throughout the model with dropout rates varying from 10% to 100% in steps of 10%.
The model with the best AUC-PR, that of 0.8104, was that employing dropout with a rate of 10% throughout the model.

k. L2 Normalisation Position and Factor
L2 normalisation was also investigated.As with the previous optimisation, it was trailled in the same three places.The normalisation factors checked were 0.00001, 0.00005, 0.0001, 0.0005, 0.001, 0.005, 0.01, 0.05, 0.1, 0.5.The model that produced the best performance with a mean AUC-PR of 0.8128.

l. Batch Size
This final optimisation tested was of varying the batch size.Batch sizes of 8, 16, 64, 128, 256, 512, 1024 and 2048 were tested with the first option producing the best performing model with a mean AUC-PR of 0.8135.

m. Others
Other optimisations tested which did not produce a model with an improved performance included batch normalisation, varying the number of hidden layers and nodes and using different weight initialisation methods and activation functions.
The 8 equal parts of the ERA-Interim data which were used to create the training dataset (the area shaded in white was not used).
matrix resulting from inference on the testing dataset.F .4. Precision-Recall curve for final trained model evaluated on the testing dataset.

F
. 6. Feature Importance using theBreiman (top)  and Lakshamanan (bottom) methods for model using the test dataset.-axis is AUC-PR after permutations, with non-permuted model obtaining an AUC-PR of 0.7173.
Mean Case data originating only from the Western Atlantic and Western Pacific regions (left column) and for all regions (right column).Rows show each of the five input fields.
Test AUC-PR and Loss for TCDetect with different percentages of the training set used.

F
. A1. Loss for model trained and tested on data from the Western Atlantic and Western Pacific regions before applying Early Stopping.F .A2. AUC-PR model trained and tested on data from the Western Atlantic and Western Pacific regions before applying Early Stopping.
Racah et al. (2017)created a method where a deep learning model takes in a snapshot of the world (in this case from a CAM5 climate model) with 16 different meteorological channels and creates bounding boxes around the detected TCs.The architecture used was that of an auto-encoder with three smaller networks using the bottleneck layer to

Convolutional Base Total Parameters AUC-PR Recall Precision
T 5. Comparison of total parameters used and performance metrics for model being presented in this paper and similar models using more standard convolutional bases.
) in the test set PR when performing step-wise manual hyperparameter tuning using data from different regions in the validation dataset.

South Indian South Western Pacific South Eastern Pacific South Atlantic North Indian North Western Pacific North Eastern Pacific North Atlantic Number of positive cases
Evolution of accuracy during model development by basin (see text for explanation of rows).