Analysing Maximum Monthly Temperatures in South Africa for 45 years Using Functional Data Analysis*

The paper uses Functional Data Analysis (FDA) to explore space and time variation of monthly maximum temperature data of 16 locations in South Africa for the period 1965 - 2010 at intervals of 5 years. We explore monthly maximum temperature variation by first representing data using the B-spline basis functions. Thereafter registration of the smooth temperature curves was performed. This data was then subjected to analysis using phase-plane plots which revealed the constant shifting of energy over the years analysed. We next applied functional Principal Component Analysis (fPCA) to reduce the dimension of maximum temperature curves by identifying the maximum variation without loss of relevant information, which revealed that the first functional PCA explains mostly summer variation while the second functional PCA explains winter variation. We next explored the functional data using functional clustering using K-means to reveal the spatial location of maximum temperature clusters across the country, which revealed that maximum temperature clusters were not consistent over the 45 years of data analysed, and that the cluster points within a cluster were not necessarily always spatially adjacent. The overall analysis has displayed that maximum temperature clusters have not been static across the country over time. To the best of our knowledge, this the first instance of performing in-depth analysis of maximum temperature data for 16 locations in South Africa using various FDA methods.


Introduction
South Africa has raised concerns within the climate change community by sharing its experiences of frequent extreme events (Ziervogel, 2014). These extreme weather episodes, and consequences thereof, pose acute stress on South Africa's water resources, food security, health, infrastructure and biodiversity, which directly and indirectly affects human well-being, particularly of those who are most vulnerable.
If we focus only on the understanding of temperature in South Africa, there remains a need to obtain better insights into extreme temperature patterns to better understand how annual temperature profiles vary across South Africa. Note, analysing average monthly temperature does little justice as it smooths out the effect of extreme temperatures that occur during the course of the month.
Furthermore, it is important to keep cognisance of the fact that each year, the maximum temperature (and indeed the minimum temperature) does not necessarily occur at the same time in the year, and it is important that before one can analyses variations of a specific weather measurement (such as maximum temperature), one aligns the landmarks (such as annual peaks, valleys) across the years.
The framework under which such analysis can be undertaken is the Functional Data Analysis (FDA) framework (Wang, 2015), in which first the monthly data are smoothed into annual curves, and thereafter the smoothed annual curves are aligned with regards landmark features of interest, allowing investigation of variations in both phase and amplitude across the years.
In addition, the FDA framework allows the analysis of other features contained within the data such as identification of the principal components of variations, identification of clusters of curves and the analysis of the rates of change of the curves.
In the analysis, we revisit the understanding of the maximum temperature variations across South Africa using the FDA framework. We specifically focus on maximum monthly temperature data between 1965-2010 from 16 locations spread across the South Africa. Below we present in detail the historic and current state of South African, and indeed of southern African, climate and its impact on the global climate scene with special focus on literature discussing the impact of extreme temperature variations on the lives of those living in the region, and motivate why the continued understanding of temperature data using emerging statistical tools is hence vital.
We next introduce FDA methodology in greater detail as is used within the scope of this paper.
We next present the data that is analysed using various FDA methods and discuss the FDA results thereof. Finally, we present concluding remarks in the context of what the findings may mean for those living in the region.

Methodology
Functional Data Analysis (FDA) is the analysis of data that is in the form of curves (Wang, 2015). The objectives of FDA are similar to any standard statistical analysis that is to investigate patterns of variability in the data, to estimate summary statistics, to build models and aid in the process of inference (Ramsay J. O., 2007). FDA methods start with transforming discrete data into functions by smoothing them over a specific continuum (such as a year).
Mathematically, functional data are derived from a set of observations from a continuous underlying process ( ), observed at time . We denote by ( ) the observed ( ) with a noise component ( ), where = 1, … , (Levitin, Introduction to functional data analysis, 2007). Therefore, a single functional observation ( ) is derived from of pairs ( , ( )) as follows: where ( ) is a smooth function, ( ) is the observation and ( ) is a measurement error or noise. To fit a curve to transform the discrete data into functional data curves, a smoothing process is used as follows: where ( ) are basis functions and are the coefficients associated with basis functions (Olorunmaiye, 2016).
A basis function is a set of known functions that are independent of each other, which can approximate a function by a weighted sum of a large number , of such functions. The amount of smoothness of the function is determined by number (Ramsay J. O., 1998). B-splines are polynomials joined together at interval endpoints, known as knots, and they are also defined by order and degree of the polynomial, where the order of a polynomial is one higher than its degree (e.g., a cubic polynomial of degree 3 is of order 4 spline) (Levitin, Introduction to functional data analysis, 2005). Thus, the number of basis functions is equal to the number of knots plus order of the spline (Ramsay J. O., 2009).
Most of the curves in a functional space, like the much cited human growth curve data (Ramsay J. O., 2006), exhibit variability both in terms of amplitude (vertical) as well as phase (horizontal). Phase variation contains interesting information in the timing of the curves' important peaks (or troughs). To investigate phase and amplitude variation, the curves need to be aligned using any of the curve registration methods such as landmark registration, continuous registration, and shift registration, which use the time-warping function to transform the domain for each curve (Marron, 2015).
Time-warping is a technique that transforms the domain (e.g., time) for each curve to align certain features of interest, with the property that they must always be strictly increasing, as time cannot go backward over an interval [0, ]. A time-warping function ℎ(. ) must satisfy the constraints that ℎ(0) = 0 and ℎ( ) = . In this paper we use continuous registration to align monthly maximum temperature curves. Continuous registration is a method that uses the entire curve, rather than specified features, to align the curves, and uses the time-warping function ℎ(. ) to minimize amplitude variation (Ramsay J. O., 2009).

Derivatives and Phase-Plane Plots
After the smoothing process, the derivatives of the curves can be obtained as follows: where denotes the ℎ derivative operator (Ramsay J. O., 2006). As some of the variation in a curve can be explained at the level of certain derivatives, the use of phase-plane plots to visualise velocity against acceleration provides valuable insights (Hall, 2009), and provide a graphical representation of energy within the system, with the amount of energy in the system being related to the height and width of the ellipse. Specifically, kinetic energy is associated with high velocity and low acceleration, while the potential energy is characterised by high acceleration.

Functional Principal Component Analysis (fPCA)
fPCA is a dimension reduction tool for multivariate data that has been extended to functional space (Wang, 2015). fPCA is similar to a standard Principal Component Analysis (PCA), with the primary difference being that PCA does not account for smoothness and continuity while fPCA does (Hadjipantelis, 2018). fPCA transforms or reduces high dimensional dataset to a low-dimensional dataset which contains a set of uncorrelated components that summarises features that represent the original dataset and captures the main modes of variability in the data (Cardot, 2008) using eigenvalues and the eigenfunctions. One of the challenges in fPCA application is the selection of the number of components to retain or reject (Hadjipantelis, 2018).

Functional Clustering
The purpose of functional clustering is to identify representative curve patterns which are likely generated from the similar process (Zhang, 2014). K-means and Hierarchical functional clustering are two popular algorithms for functional cluster analyses. Given a set of functional data { ( ); = 1, … . , } , K-means finds a set of cluster means denoted by { ; = 1, … , } by minimising the sum of the squared distance between { } and the cluster centres { ; = 1, … . , } (Wang, 2015).
Hierarchical functional clustering is similar to regular Hierarchical clustering and uses either the 'agglomerative algorithm' or the 'divisive algorithm' to group curves into clusters. The agglomerative algorithm in functional space starts by calculating the distance between the curves, then calculating the distance or dissimilarity matrix, and finally proceeds to apply agglomerative criterion (single, complete, or average linkage) (Giraldo, 2012).

Data
For our analysis, weather data was acquired from the Council of Science and Industrial Research (CSIR). The original data comprised of temperature, precipitation and relative humidity in the form of NETCDF (Network Common Data Form). In this paper we only focus on monthly maximum temperature, which are first converted from the NETCDF file format into a .CSV (Comma-Separated Values) file.
The unit of temperature in the data was in degree Kelvin, which was converted to degree  -year interval, specifically 1965, 1970, 1975, 1980, 1985, 1990, 1995, 2000, 2005  All analysis in this paper are done in the RStudio environment, using the following R packages: fda, FunFEM, fda.usc, and fdasrvf. For the spatial visualisation, we used the QGIS software.

Figure 1
Map of South Africa with lo cations considered in the paper

Results
Data of 10 years between 1965 and 2010, with an interval of 5 years, from 16 locations was analysed, first by smoothing each of the 10x16 = 160 curves, then aligning the curves and then exploring the fPCA, phase-plane plots, and functional cluster methods on the registered curves.
To smooth monthly maximum temperature data of 10 years from all 16 locations, using Bspline with 3 knots and degree 2. We then aligned the curves with the target function so that the peaks, valleys and crossings occur at the same argument as those of the target, using continuous registration.
In Figure 2 we visualise the raw smoothed maximum monthly temperature curves (a), and the corresponding registered smoothed maximum monthly temperature curves (b). In our exploratory analysis investigation of the data using phase-plane plots, we use the average monthly maximum from the registered curves of the 16 locations. The numbers 1 to 12 inside the phase-plane plots represent the months January (1) through December (12). We do so for all the 10 years considered in this paper.
The phase-plane plots reveal that in 1980 and 1985 we had similar weather patterns, as are for the years 1990 and 2005. In 1965 positive potential energy is greatest around August to December, implying steady increase in temperature. In 1975, from January to May, temperature is decreasing with zero velocity in June, with temperature increasing from July to December with high acceleration from September to December.
In 1980 and 1985 the phase-plane plots suggest that we had similar weather patterns, steadily increasing from January to April and decreasing in May, with November and December with larger kinetic energy.
In both 2000 and 2005 there is large kinetic energy in April and September, implying lower temperature in these months, and there is also larger potential energy in January and November.
In 1995   We performed fPCA using the fda library function in the fda package on the registered monthly maximum temperature data for the 16 locations, and consider the first two harmonic functions, or functional principal components, to explain variations in the data. We observe that the first two fPCs explain 98.7% of the total variations in the data as evidenced from Figure 5.
Furthermore, we observe that fPC 1 explains about 70% of the variations (Figure 5(a)), which correspond to the summer variations, while fPC 2 explains about 29% of the variations ( Figure   5(b)), which correspond to the winter variations. The solid lines in Figure 5 represent the mean curve, while the curves labelled with a '+' or a '-' indicate the one standard deviation added or subtracted from the mean respectively.
We use the registered monthly maximum temperature curves from the 16 locations and 10 years to group them into clusters that have high inter-cluster variability, and low intra-cluster variability. We chose number of clusters = 3 when using K-means and Hierarchical algorithm the funFEM library function within the fda package.
In the Figure 6 we present results from the clustering algorithm. When we investigate the clustering analysis output, we observe that in Cluster 1 (black) two locations feature From the Figure 6 (a) we observe that the clustered curves have a common shape which shows that from the month of February to April, and October to December, we have higher temperatures, and lower temperature from May to September, with the lowest being between June to August, which correspond to the winter season. The mean curve of each cluster in    In order to obtain greater insights into the clustering data, we display our cluster results on the map for each year in this study between 1965 to 2010, with a gap of 5 years, in Figure 9, and 10 below. In Figure 9: (a) we observe that the three locations in Northern Cape are grouped in one cluster; (b) we observe that the locations close to Indian Ocean are grouped together with the inland locations expect Thabazimbi; (c) We observe that all locations close to Indian and Atlantic Ocean are grouped in one cluster together with Dullstroom; (d) all coastal locations and inland locations are grouped in both cluster 1 and 2 except Alexander Bay and Cape Town (coastal location) which are grouped in cluster 3; (e) locations both inland and coastal are spread across in cluster 1 and 2 except Alexander Bay and Cape Town (coastal location) which are in cluster 3; (f) we observe that the inland locations were grouped together in one cluster, coastal locations in the other, and the Northern cape province locations which also falls in the coastal area in the other one.
The spatial visualisation reveals that Cape Town and Alexander Bay are always grouped together in one cluster, and also Johannesburg and Pretoria are always in the same cluster. This suggests that in all the years analysed Cape Town and Alexander Bay, and Pretoria and Johannesburg always had similar maximum temperature patterns.

Discussion and Concluding Remarks
In this paper we have investigated time and space variation for monthly maximum temperature curves using the methods within the FDA framework, specifically, phase-plane plots, functional principal component and functional clustering methods. The phase-plane plots offered an advantage to the analysis of registered monthly maximum temperature curves, were the energy is constantly shifting in most years analysed.
The application of fPCA allows us to summarise high dimensional modes of variation in temperature curves without loss of relevant information. Our analysis using fPCA methods showed that monthly maximum temperature curves of 16 locations spread across South Africa display variation over time, and particularly reveal that temperature pattern were more variable during austral summers (December to March) and less variable during austral winters (June to August).
Functional clustering analysis revealed that there are distinct temperature clusters, with some clusters comprising of consistent locations across the time period analysed, while other locations seem to have less cluster loyalty. The cluster with higher temperature values contain most of inland locations, and ones with average and lower temperature values have coastal locations.
The application of FDA methods on temperature data has shown that more insights can be obtained about temperature data variations both spatially as well as temporally. In this work we focused only on the application of temperature data, however Functional Data Analysis methods can be applied to other weather data to get a more holistic insight into climate induced changes over time and space.