Self-Organizing Maps: A Powerful Tool for the Atmospheric Sciences

Self-organizing maps (SOMs) are a powerful tool used to extract obscure diagnostic infor‐ mation from large datasets. In the context of issues related to threats from greenhouse-gasinduced global climate change, SOMs have recently found their way into atmospheric sciences, as well. In meteorology SOMs provide a means to visualize the complex distribu‐ tion of synoptic weather patterns over a region of interest (Hewitson and Crane 2002), ex‐ plore extreme weather and rainfall events (Hong et al. 2005, Zhang et al. 2006, Uotila et al. 2007), classify cloud patterns (Tian et al. 1999, Ambroise et al. 2000) and reveal causes and effects of climate changes projected using global climate models (Lynch et al. 2006; Cassano et al. 2007, Skific et al. 2009a, 2009b).


Introduction
Self-organizing maps (SOMs) are a powerful tool used to extract obscure diagnostic information from large datasets. In the context of issues related to threats from greenhouse-gasinduced global climate change, SOMs have recently found their way into atmospheric sciences, as well. In meteorology SOMs provide a means to visualize the complex distribution of synoptic weather patterns over a region of interest (Hewitson and Crane 2002), explore extreme weather and rainfall events (Hong et al. 2005, Zhang et al. 2006, Uotila et al. 2007), classify cloud patterns (Tian et al. 1999, Ambroise et al. 2000) and reveal causes and effects of climate changes projected using global climate models (Lynch et al. 2006;Cassano et al. 2007, Skific et al. 2009aSkific et al. , 2009b. The SOMs' unsupervised learning algorithm reduces the dimension of large data sets by grouping similar multi-dimensional fields together and organizing them into a two-dimensional array (Kohonen 2001). To a trained operational meteorologist the interpretation of SOMs is intuitive, as they are reminiscent of synoptic charts arranged adjacent to one another according to their similarity (much like tracking a weather system in time, as is done in synoptic meteorology, Hewitson and Crane 2002). Although still largely underutilized, SOMs are gradually becoming more widely used for applications in atmospheric science. Unlike most traditional clustering algorithms, SOMs attempt to conserve space continuum, utilizing the information from the provided data. The resulting clusters will therefore have some resemblance because the process of SOM creation assumes that a single sample of data will contribute to the creation of more than one cluster, as the whole neighborhood around the best matching cluster is also updated in each step of training. It will also result in a more detailed presentation of particular features appearing on neighboring clusters, if the information from the original data enables it to do so. On the other hand, as the SOMs attempts to span a continuous data space, some of the resulting clusters may have only a few members ascribed to them, in the attempt to overlap the data gap or the region where data information exists but is very sparse.
This chapter provides a brief summary of several experiments using SOMs to explore how Arctic climate will change by the end of the 21 st century. It demonstrates how the SOM technique can be adapted to quantify a change in a meteorological variable of interest and possibly reveal the underlying mechanism driving that change.

Data preparation
In this application, the high-dimensional data subjected to SOM analysis are daily fields of sea-level pressure (SLP) anomalies simulated by the Community Climate System Model, version 3 (CCSM3), for time periods from 1960 to1999, 2010 to 2030, and 2070 to 2089. The latter two periods are extracted from a simulation for the "worst-case scenario: of greenhouse gas emissions for the 21 st century as specified by the Special Report on Emission Scenarios, SRES A2 (Nakicemovic and Swart 2000). These scenarios are based upon assumptions for future greenhouse gas pollution, land use, global economic development, etc. The SLP fields are then interpolated from the original 1.4º x 1.4º grid to a 200 km x 200 km Equal Area-Scalable Earth (EASE) grid (Armstrong et al. 1997), covering the area north of 60ºN and consisting of 51 x 51 grid points. Interpolation to an equal area grid avoids errors that might occur owing to equal weighting in the SOM algorithm of the original latitude-longitude grid boxes, which decrease in size toward the pole.
Daily SLP anomalies are then derived by subtracting the gridpoint SLP from the domainaveraged SLP for each daily field ). The spatial distribution of the daily SLP anomalies represent the SLP gradient, which drives the strength and direction of the circulation, without being influenced by fluctuations in the area-mean absolute SLP values. Areas with elevation higher than 500 m are removed from the fields because pressure reduction to sea level can lead to unrealistic singularities emerging in the SOM training, which then obscure the realistic patterns.

SOM methodology
The SOM consists of a two-dimensional grid of clusters or nodes, which in this case is a grid of SLP anomaly maps. Each node i corresponds to an n-dimensional weight or reference vector, m i , where n is the dimension of the input data, treated as a vector created from the gridpoints in each sample.
The initial step of this routine is the creation of a first-guess array, which consists of an arbitrary number of nodes and corresponding reference vectors. In this study we use a grid of 35 nodes, creating a 7x5 array. Slightly smaller and larger SOM matrices were tested to deter-mine a suitable number of nodes for this analysis. If the matrix is too small, some characteristic atmospheric patterns may not be represented; if it is too big, adjacent patterns will be too similar and visualization is unwieldy. The 7x5 matrix appears to capture and separate the important differences in pressure patterns. Moreover, the results are not affected by small differences in the matrix size (see Skific et al. 2009a).
The reference vectors are created at the beginning using linear initialization, which consists of first determining the two eigenvectors with the largest eigenvalues, then letting these eigenvectors span the two-dimensional linear subspace (Kohonen 2001). We use the covariance matrix of the input SLP dataset to determine the two eigenvectors. In this case the centroid of a rectangular array of initial reference vectors identified with array points corresponds to the mean of the sea level pressure values, and the vectors identified with the corners of the array correspond to the largest eigenvalues. By initiating a SOM in this way, the procedure starts with an already ordered set of weights, then training begins with the convergence phase. Linear initialization helps achieve faster convergence, which is an advantage of this procedure over other methods (i.e., random initialization), but the SOM results are not sensitive to the selected initialization method. In the process of training, each data sample (i.e., one daily map of SLP) is presented to the SOM in the order it occurs in the original data set.
The similarity between the data sample and each of the reference vectors is then calculated, usually as a measure of Euclidean distance in space. In this process, the "best match" node is identified as that with the smallest Euclidean distance between its reference vector and the data sample. Only the vectors for the best-matching node and those that are topologically close to it in the two-dimensional array are updated. The updating scheme is shown below where t is a discrete-time coordinate, m i is a reference vector, x is a data sample, and h ci is a neighborhood function (Kohonen 2001), usually in the form of the Gaussian function, α is the training rate function (usually an inverse function of time), r is the location vector in the matrix, ||r c -r i || corresponds to the distance between the best-matching node (location r c ) and each of the other nodes (location r i ) in the two-dimensional matrix, and σ defines the width of the kernel, or a relative distance between nodes, often referred to as the radius of training. The training procedure is controlled by the training rate α, the training radius r, and the duration of training, which is fixed at 20 times the number of data samples. This choice is based on the "rule of a thumb" for optimal training length, which should be longer than 500 times the vector size (see Kohonen 2001). The initial value of r is 4, and decreases linearly in time. The training scheme is repeated several times, with the training rate reduced by an order of a magnitude each time. At the end of each trial the mean quantization error is calculated, defined as where x i is a data sample, M is the number of samples, and m c is its best matching unit out of 35 reference vectors. A smaller mean quantization error indicates a closer resemblance between m c and the daily SLP anomaly fields. Once the smallest mean quantization error is found, we then fine-tune the training by varying the training rate slightly around that value and calculating the error for each trial. The training is complete once the smallest mean quantization error is identified, as the reference vectors from that training best approximate the data space of interest. The final reference vectors are then mapped onto a 2D grid, with their locations in the matrix corresponding to their matching nodes. The maps in the resulting matrix represent the predominant patterns in which the atmosphere tends to reside, or alternatively the centroid of the particular data cluster.
Although the measure of similarity between the data and the reference vector is linear, it is this iterative training procedure that allows the SOM to account for the non-linear data distributions (Hewitson and Crane 2002). The non-linear approximation of the data space is therefore a great advantage of the method compared to some other approaches, such as empirical orthogonal functions (EOFs) (Reusch et al. 2005).

Detection of regional climate change
Once the SOM has been trained and the final set of reference vectors has been identified, daily fields of SLP anomalies can be mapped to the best-matching pattern to form clusters of daily maps that are most similar to each pattern. This is achieved by finding the pattern in the SOM that minimizes the Euclidean distance between itself and the daily field. Once all the SLP anomaly fields have been assigned to a node, the frequencies of occurrence (FO) can be determined, i.e., the fraction of daily fields that reside in each cluster. Ascribing a particular daily SLP sample to a specific circulation pattern in the SOM also enables an analysis of associated variables (such as temperature, precipitation, cloud amount, etc.) for the same days as those in each cluster. By mapping the new variable onto a particular SLP-derived cluster, the matrix of maps for any other variable can be used to describe the conditions associated with a specific circulation regime. The following examples elaborate this procedure in more detail. Figure 1a shows dominant circulation regimes in which Arctic atmosphere resides, according to the CCSM3 model output described above. These clusters or neuronal weights form a discrete approximation of the data distribution, which in the process of SOM creation, become organized on a 2D grid. Clusters near each other on the grid are more similar than clusters farther away. Most distinct patterns are situated in the corners of the map, while the cluster positions on the master SOM and their mutual distances approximate the probability density function of a given dataset. This technique results in an overlap among neighboring clusters because the process of training a single data sample will contribute to defining the neighboring clusters as well, not only to the most similar one.
In this example, the clusters in the lower right side of the master SOM are characterized by a pronounced low pressure center in the North Atlantic and Pacific region, and high pressure over the Eurasian continent. These features bare close resemblance to the North Atlantic Oscillation (NAO), the most dominant SLP mode of variability in the high-latitude winter atmosphere. The upper right side of the master SOM has cluster groups characterized by a pronounced low pressure in the North Atlantic that extends farther northward and eastward, towards the Norwegian and Barents Seas. A strong high-pressure ridge generally resides over the western Arctic in these clusters. The lower left corner of the map contains clusters with low pressure over the Arctic, while in the clusters of the upper left corner low (high) pressure is generally present in the western (eastern) Arctic. The clusters in the middle of the map show a weak or moderate ridge over the central Arctic region.
Further insight about the relationships between adjacent nodes is provided by the so-called Sammon map (Figure 1b). This distortion surface is a projection of Euclidean distance between neighboring nodes of the SOM matrix ( Figure 1a) to a set of 2D vectors, following a Sammon mapping algorithm (Sammon 1969). Numbers 1 to 35 on the distortion surface correspond to the nodes on the SOM matrix, from the upper left to the bottom right corner of the map. Generally, the closer two nodes are together on the Sammon map, the more similar they are to each other. Although the distortion surface generally conforms to the expected rectangular shape of the SOM matrix (Figure 1a), some more detailed relationships between the nodes are revealed. One can see that nodes on the left of the SOM are closer in Euclidean space than those on the right. This shows that node distribution approximates the multi-dimensional distribution function, as the nodes are more closely spaced in regions where data density is higher, i.e., where nodes have more members in their group. The Sammon map is also useful for visualizing the process of creating the SOM. If one were to draw a Sammon projection at each step of SOM training, each iteration would look like a wrinkled tablecloth that is gradually becoming less wrinkled, until at the end of the training, it resembles its familiar rectangular shape (Figure 1c).
To better understand the origin and characteristics of circulation patterns in the master SOM, we create monthly histograms of CCSM3 mean SLP for each node in the SOM. Figure  2 shows a matrix of histograms corresponding to the master SOM ( Figure 1). The x-axis of each histogram is a month, while the y-axis is the number of times that this particular cluster was a best match for an individual daily SLP map during that month. The histograms illustrate the frequency with which a particular regime occurs during each month. As mentioned earlier, the clusters on the right side are recognized as the NAO pattern, which is most common in winter. The histograms for these clusters confirm that the non-summer days are most likely to have these patterns. A single cluster in the upper left corner of the master SOM also describes non-summer circulation patterns. These cold-season histograms show that the clusters with pronounced low pressure in the North Atlantic along with a ridge over the Eurasian continent or over the western Arctic occur most frequently in the winter months. Come spring, although still present, these patterns become less frequent through summer, then increase again in fall. These patterns thus depict the annual cycle of development and intensification of the Icelandic low, influenced by strong temperature gradients along the sea ice margins (Serreze 1995;Serreze et al. 1997a), "splitting" of lows moving in from the south and southwest by the high orographic barrier of Greenland (Serreze and Barry 2005), and lee-side vorticity production off the southeast coast of Greenland (Petersen et al. 2003). The vigorous synoptic activity in the Atlantic is also related to the large moist static energy transport into the Arctic through this sector (Overland et al. 1996). Features towards the middle left side of the master SOM are primarily summer patterns, with increased FOC as the warmest season approaches. They are characterized by either weak low pressure, most commonly in the western Arctic, or a weak atmospheric ridge over the central Arctic.
(a) Developments and Applications of Self-Organizing Maps Frequency of occurrence displayed as monthly histograms for each node in the master SOM. The x-axis is month, y-axis is the number of times that node was the best matching unit during that month.
The SOM algorithm keeps track of which days in the data record fall into which of the clusters, thus one can analyze other variables corresponding to the same days that exhibit that circulation regime. This is a highly valuable tool for attributing causes for variations in particular variables. We illustrate this tool by investigating fields of cloud fraction. Figure 3 shows cluster-averaged cloud fraction for the CCSM3 1971-1999 period. The fact that the minimum values are near 50% indicates that the Arctic is a very cloudy place. Low stratus clouds account for over 80% of the total cloud cover in this region. Clusters on the right side of the map correspond to the cold-season regimes, and as expected, these are characterized by lower values of cloud fraction, typically around 50% for the central Arctic region. Summer patterns, located in the middle left of the SOM, show the most extensive cloud cover for the central Arctic, ranging around 75%. The projection of cloud features on a geographic map assist in understanding the reasons for observed cloud patterns. Variability over the central Arctic, for example, is primarily driven by the seasonality in the low-level stratus, which is related to the availability of moisture. In summer when the surface is melting, evaporation increases and clouds are more abundant than in the much colder winter months when evaporation is small (Beesley and Moritz 1999). The Atlantic sector, in contrast, is frequently overcast, reflecting abundant moisture and intense cyclonic activity, particularly during the winter season. Large cloud fractions over the Scandinavian Peninsula in winter patterns (right side of the map) are due to orographic uplift at the continental boundary. Heavy overcast is also seen along the land-ocean boundary in the summer patterns (in the middle left of the SOM). These areas of increased cloudiness likely reflect summer cy-clonic activity in connection with the development of the summer Arctic frontal zone. This discontinuity is sustained by differential heating between the Arctic Ocean (at temperatures kept at the melting point) and snow-free land, as well as sharpening of the baroclinicity by coastal orography (high topography can "trap" the cold ocean air). Increased summer cloud fraction over Eurasia and northwestern Canada likely also reflects increased cyclogenesis over land areas and more abundant convective clouds, in contrast to the winter months that are dominated by low stratus. Another related variable that is highly dependent on cloud characteristics is the downward longwave (infrared) radiation flux (DLF). This is an important quantity for studies of the Arctic energy balance, as no energy is received from the sun during the 6-month polar night. Figure 4 shows cluster-averaged downward longwave flux (DLF), corresponding to each circulation pattern on the master SOM (Fig 1). The spatial pattern of the mean DLF is similar to the distribution of total cloud fraction (Fig 3), underscoring the close relationship between DLF and cloud properties. During the cold season (patterns on the right of the matrix) the highest values of DLF occur in the Atlantic sector, where cloud cover is extensive both horizontally and vertically, water vapor is abundant, and temperatures are relatively warm. The surface elevation of central Greenland is above much of the cloud cover, which along with low temperatures and humidity values, results in low DLF values in all clusters and seasons. Higher values of DLF for the clusters on the left side of the map, corresponding to the summer patterns, are related to seasonal increase of surface temperature and atmospheric water vapor. In addition to investigations of relationships between circulation patterns and corresponding atmospheric variables, the SOM analysis also allows an assessment of changing tendencies for the atmosphere to reside in particular circulation patterns. These kinds of questions are relevant to climate-change studies (for example, whether summer-like patterns are becoming more frequent as the climate warms, or whether the relative locations of high and low pressure centers are shifting. The top two panels of Figure 5 display the percentage of winter and summer days that fall into each node during the late 20 th century. Corresponding to the histograms shown in Figure 2, it is clear winter days tend to exhibit the circulation patterns along the right side of the master SOM, while summer patterns occur in nodes in the center left. By calculating frequencies of occurrence (FO) in two time periods, one can determine whether the atmospheric circulation patterns are shifting. We present an example in the lower panel of Figure 5 that assesses changes in the FOs of each cluster on the master SOM from the late 20th century  to the late 20st century (2070-2089). Black solid (dashed) contours show areas of significantly (> 95% confidence) higher (lower) difference in frequency of occurrence. The range in a 95% confidence interval, ±1.96 Developments and Applications of Self-Organizing Maps where p 1 (1-p 1 )/n 1 and p 2 (1-p 2 )/n 2 are variances of two independent, random, binomial processes, p 1 and p 2 are the expected frequencies of occurrence for the two time periods (p=1/35), n 1 is the number of samples in the first data set, and n 2 is the number of samples in the second data set (for more details see Cassano et al., 2007). Because this statistical test does not account for the effects of serial correlation in the daily SLP fields, and thus likely overestimates the degrees of freedom, we determine an approximation for the effective degrees of freedom by dividing the number of samples of the two data sets by 7. This value is determined from the serial correlation of the SLP time series, which indicates that the atmosphere tends to reside in a circulation regime for about one week. This procedure decreases the degrees of freedom, thus establishing a higher threshold for determination of a significance level. A pronounced, statistically significant increase is apparent in patterns with low pressure over the central Arctic (left of the master SOM, Figure 1), as well as those with strong high pressure across the western Arctic region and strong low pressure in the Atlantic sector and eastern Arctic (upper right SOM). The clusters in the middle, mostly dominated by a weak or a moderate high pressure over the central Arctic, decrease significantly. Taken together these changes represent a decrease in pressure over the central Arctic in this greenhousegas-forced model projection, but SOMs and their corresponding FOs allow for a more detailed, quantified look at regional variability in pressure change.

Attribution of regional climate change
This section demonstrates how the SOM technique can be adapted to better understand what processes are driving the change in a variable of interest. Cassano et al. (2007) formulated a method that separates the factors contributing to a temporal change in a variable of interest into the portion caused by a change in the FO of daily maps in a cluster, the portion due to a change in the cluster-mean value of the physical variable, and a third due to a combination of the two effects. The equation is given as follows: where Δx is the total change in a variable between two different time periods, x i is the cluster-averaged variable in the initial time period, f i is the FO of the daily maps in cluster i during the initial period, Δf i is the change in FO for cluster i between the two periods of interest, Δx i is the change in the cluster-averaged variable between the two periods of interest, and N is the total number of clusters (N = 35 in this study). Expanding (1): The first term, x i Δf i relates changes in the pressure field to changes in the FO of circulation patterns. It shows a portion of the total change owing to shifts in the frequencies with which daily SLP fields reside in the patterns depicted in the SOM. A change in this distribution represents a change in the surface circulation, and thus we loosely refer to this contribution as the "dynamic factor." The second term, f i Δx i , relates to temporal changes in the variable of interest averaged over all days that belong to a cluster. In the case of cloud fraction, changes of this type are likely caused mainly by thermodynamic effects -such as changes in the horizontal and vertical distribution of water vapor, varying moisture and temperature gradients, or changes in evaporation-thus we refer to this contribution as the "thermodynamic factor" Skific et al. 2009a;2009b). The third term in Eq. 2 represents the contribution from the interaction of both changing pattern frequency and the cluster-averaged variable. This term tends to be small compared to the other two.
(b) Figure 6. a: Annual and seasonal changes in total cloud fraction (%) in the region north of 70°N in the CCSM3 from the late 20 th century  to the late 21 st century (2070-2089). Contributions to the total change from the dynamic, thermodynamic, and combined terms are also shown b: Annual and seasonal changes in downward longwave flux (DLF, in W m -2 ) over the region north of 70°N in the CCSM3 from the 20 th century  to the late 21 st century (2070-2089). Contributions to the total change from the dynamic, thermodynamic, and combined terms are also shown.
We therefore utilize the derived master SOM and the aforementioned procedure to determine a portion of a total change in cloud fraction ( Figure 6a) and DLF (Figure 6b) from the late 20 th century to the late 21st century due to dynamics, thermodynamics, and their combination. The experiment focuses on both the annual and seasonal changes. Total cloud fraction north of 70ºN is projected to increase by 6.5% by the end of the 21 st century, with largest changes expected to occur in the fall and winter (increase in cloud fraction of about 2%). The figures show that for both variables, the increase is caused mainly by the thermodynamics factor, while the contribution to cloud and DLF changes from shifting FO of particular pressure patterns is relatively small. Thermodynamic processes driving the cloud changes are likely related to local temperature changes and their effects on cloud formation, such as changes in precipitable water that would lead to changes in cloud amount even under the same dynamics, changes in the temperature and humidity profiles, and changes in relative humidity owing to changes in evaporation. The largest increase in cloud fraction in this model run occurs in the fall season and smallest in the summer (Figure 6a). These changes are driven by the projected increases in surface temperature, which are strongest in fall and winter. It also suggests that the seasonality in cloud cover over the Arctic is primarily driven by low-level stratus (Serezze and Barry 2005), as they constitute about 80% of the total cloud cover over the central Arctic Ocean. They are greatly affected by the changes in surface temperature through its affect on the stratification of the atmospheric boundary layer. Fall and winter seasons have experienced the largest changes in surface temperature in connection with the loss of sea ice and resulting increased absorption of solar energy into the Arctic Ocean.
DLF increases most in the winter (by 12.6 W m -2 , Figure 6b), which points to the sensitivity of DLF to surface temperature, water vapor, and cloud cover. Small summer increases in both cloud amount and DLF because surface temperature is fixed at the melting point while sea ice melts, and a heavy overcast of low clouds persists through the melt season. These relatively constant conditions dictate small changes in DLF. Because the model projects only a small increase in summer surface temperature by the end of the 21 st century, it is expected that cloud fraction and DLF will not increase substantially either.
Changes in both cloud fraction and DLF are driven mainly by processes that occur for a fixed circulation regime, which implies that the same synoptic pattern with similar pressure gradients will generate more clouds and thus a larger DLF occur by the end of the 21 st century. Because clouds are an important element of Arctic energy budget, these changes will feed back on surface temperatures, leading to even higher surface temperatures.

Conclusion
Here we present examples of our application of SOMs to understanding changes in the Arctic climate system as greenhouse gas concentrations continue to increase. The purpose of this exercise is to illustrate the power and adaptability of the SOM technique for a variety of scientific applications. Understanding and possibly predicting potential drivers of the Arctic climate change has great implications for society as a whole, such as sea-level rise, mid-latitude weather patterns, marine and terrestrial productivity, to name only a few. Linkages be-tween the dramatic changes within the Arctic and the global system remain poorly understood in present conditions, thus the uncertainty regarding future changes will remain an important focus for global-change research.
Changes in Arctic cloud fraction and downward longwave flux, and mechanisms responsible for those changes, are presented through the rather simple and conceptually appealing neural network technique of SOMs, introduced in Cassano et al. (2007), and further elaborated in Skific et al. (2009aSkific et al. ( , 2009b. Given the recent and rapid changes of the Arctic climate system and uncertainties associated with many climate feedbacks over this region, this technique offers a valuable contribution toward attributing these changes to either shifts in the dominant atmospheric circulation patterns or to thermodynamically driven changes in the set of variables associated with them. Developments and Applications of Self-Organizing Maps