Coarse-Grain Cluster Analysis of Tensors With Application to Climate Biome Identification

A tensor provides a concise way to codify the interdependence of complex data. Treating a tensor as a d-way array, each entry records the interaction between the different indices. Clustering provides a way to parse the complexity of the data into more readily understandable information. Clustering methods are heavily dependent on the algorithm of choice, as well as the chosen hyperparameters of the algorithm. However, their sensitivity to data scales is largely unknown. In this work, we apply the discrete wavelet transform to analyze the effects of coarse-graining on clustering tensor data. We are particularly interested in understanding how scale effects clustering of the Earth's climate system. The discrete wavelet transform allows classification of the Earth's climate across a multitude of spatial-temporal scales. The discrete wavelet transform is used to produce an ensemble of classification estimates, as opposed to a single classification. Using information theory, we discover a sub-collection of the ensemble that span the majority of the variance observed, allowing for efficient consensus clustering techniques that can be used to identify climate biomes.


Introduction
Data measured from a high-order complex system can be difficult to analyze. A convenient tool to store such data is in the form of a tensor, or d-way array. Each entry of the array describes the value obtained across the d parameters. Often, the dependencies between indices is not clear, making interpretation of the data a demanding task. Clustering slices or sub-tensors allows one to readily parse the complex interdependencies to provide meaningful interpretation.
The focus of this paper is a new method for clustering slices of a tensor. This procedure is designed to address some of the fundamental flaws of clustering discussed below. While the method is general, it is designed for a particular domain of application, namely, an improved capability to detect and identify climate biomes. arXiv:2001.07827v1 [cs.LG] 22 Jan 2020

Clustering challenges
Clustering is known to be an ill-defined problem in the sense that no clustering algorithm satisfies all desirable clustering criterion [6]. Further, the number of clusterings for n data points is astronomically large, leading to a difficult search problem. As a result, some prior bias for exploring the space of clusterings must be adopted. However, the resulting optimization schemes are almost always NP-hard [9].
Because each clustering requires many choices, different clustering measures have been formulated to assess the quality of a clustering. As with the clustering schemes themselves, these measures are largely arbitrary. Indeed, often the measures are directly exported from the optimization functions used in the clustering algorithm. The algorithm that is designed to optimize this clustering measure will, by design, out perform other clustering methods with respect to that metric. As a result, this provides no further information as to what clustering strategy is better suited for the problem.
These challenges highlight that there is no true "best" clustering in general. Rather, there are many good clusterings that arise from the specifics of the scientific inquiry pursued. However, the quality of the clustering is not easy to evaluate. Furthermore, no particular clustering in this collection is certifiably "correct", but each provides different insights into the structure of the data.
While a collection of clusterings is more robust to error than a single clustering, it is often less interpretable. To some extent, this defeats the purpose of clustering. This lack of interpretability has led researchers to define the concept of an ensemble, or consensus clustering [12]. Here, many clusterings are combined to produce a single clustering of the data. Common features between the clusterings are amplified, and artifacts become dulled ( Figure 1a). Generally, selecting smaller ensembles with diverse clusters has been shown to outperform larger ensembles [4]. Therefore, it is advantageous for users to adopt a method of parsimony for constructing their ensemble.

Classifying climate biomes
There are still unresolved issues not addressed by the current ensemble clustering framework. One practical example is the scale at which the data is acquired. Most natural or environmental data is formed by directly observing and measuring quantities where the underlying or driving processes are usually unknown. The hidden or latent features of the data may not clearly present themselves at the resolution that the data was sampled. This problem arises in the climate sciences. Here, weather data is often gathered at fine temporal and spatial detail, e.g., daily temperature at a single weather station. However, climate signals are often observed on the order of years or decades and across a region.
Thus, climate data frequently arises as a tensor. At a single site located as a specific spatial coordinate, one has a time series of various climate measurements. The tensor of climate data compactly records the complex interdependence between space and time for different variables of interest. Clustering the data according to the spatial index is parallel to identifying climate biomes.
Historically, the standard used to classify climate biomes has been the Köppen-Geiger (KG) model [7]. The KG model is an expert based judgment that describes climate zones using temperature and precipitation measurements. The KG model utilizes a fixed decision tree, where each branch uses various information about temperature and precipitation. This heuristic allows one to broadly assess climate regions. While KG is interpretable, it is overly simplistic and somewhat arbitrary. In an attempt to remedy this problem, Thornthwaite [13] introduced a more nuanced model using moisture and thermal factors. However, the Thornthwaite model (along with its successors) still suffer from expertly chosen biases in their parameters.
A solution to this problem is to move towards data-driven methods of classification. Here, the human bias is placed onto the machine learning algorithm that seeks to minimize some cost function. This is equivalent to a statistical assumption about the data generation and distribution [1]. In the views of the authors, this is often a more reasonable assignment of bias. In [17], Zscheischler et. al. compare KG to the K-means algorithm. They show that, unsurprisingly, K-means outperforms KG with respect to some statistical measures. In [10], the authors use mean monthly climate data to perform hierarchical clustering and partition around medoids. In each clustering algorithm, two distance metrics are tested, and these results are compared to KG using an information-theoretic measure.
These data driven approaches to climate clustering are an epistemological improvement over the use chosen heuristics of KG. However, data driven methods still suffer from two key problems. First, the algorithms themselves are user-chosen, and therefore somewhat arbitrary. Because clustering is an ill-posed problem, no single clustering is necessarily a clear improvement over another. Rather, a collection of clusterings with reasonable coherency should be assembled for further analysis. Second, because the algorithms are dependent on the input data, they are dependent on the scale at which the data is acquired. A priori, it is not clear how this "hidden parameter" of scale affects the overall clustering result. Latent features of the climate system may appear at different coarse-grainings, and it is important to analyze how the scale effects the clustering of the data.

Key contributions
The above discussion highlights two important problems: 1) the unknown dependence scale has on clustering data and 2) the necessity to build an ensemble of clusters. In this work, we discuss a clustering method that illuminates these dependencies to build an ensemble of clusters that efficiently represents the diversity across different coarse-grainings. We develop a technique that uses the discrete wavelet transform to cluster slices of tensors at different scales that we call coarse-grain clutering (CGC). This results in many potential clusterings, one for each chosen coarse-graining. Not all of these coarse-grain clusterings provide new information, however. Thus, we present a novel selection method that leverages mutual information between clusterings to quantify the loss of information between clusterings and select a small subset that best represents ensemble. We call this reduction algorithm Mutual Information Ensemble Reduce (MIER).  While the end-to-end workflow we have discussed involves ideas from traditional consensus clustering (e.g., Figure  1a), the focus of this paper is specifically on a novel modification to this approach leveraging the CGC and the MIER algorithms to develop a classification using coarse-grain clustering (CGC) and in accordance with a mutual information ensemble reduction (MIER), i.e., the blue highlighted portion of Figure 1b. This paper is organized as follows. First background material used for development of the CGC and MIER algorithms is presented in Section 2. The structure of these algorithms is detailed in Section 3. In Section 4, the algorithm is applied to a widely-used climate data set as a case study with presentation of results and discussion, followed by conclusions and a recommendation for future work in Section 5.

Preliminaries
In this section, we briefly review key mathematical tools used throughout this work including 1) the discrete wavelet transform and its role in separating earth system data into spatio-temporal scales, 2) graph cuts and their connection to spectral clustering, and 3) use of mutual information to measure similarity between two clusters of the same data.

Discrete Wavelet Transform (DWT)
Given a one-dimensional discrete function f : N → R the discrete wavelet transform (DWT) is a process of iteratively decomposing f into a series of low and high frequency signals. The low frequency signal is often referred to as the approximation coefficients, and the high frequency is called the detail coefficients. This process is accomplished by convolving the function f with low frequency and high frequency filter functions that arise from a choice of mother wavelet, sometimes called a wavelet for short.
We are interested in multi-way signals, namely tensoral data. The wavelet transform of a tensor X is obtained by taking one-dimensional wavelet transforms along each axis of interest where different choices of wavelets may be chosen for each axis. Note, the DWT can be applied multiple times to a tensor axis. At each step, the signal is decomposed into its high and low frequency signals. These are then downsampled. Taking the low frequency signal, one again performs the DWT transform, splitting this into another high and low frequency signal to be further downsampled. This process, known as a filter bank, is illustrated in Figure 2. For a comprehensive overview of wavelets, see [5]. Thus, low-frequency information separated by the wavelet captures climatology and large-scale spatial features; highfrequency information quantifies weather. For example, coarse-graining temporal signals captures seasonal, yearly, and eventually decadal trends, whereas coarse-graining spatial information captures city, county, and eventually state size features. Therefore, to classify regional climate systems into biomes, we use the wavelet approximation coefficients.

K-means and spectral clustering
Clustering algorithms are diverse with varying advantages and disadvantages [3]. Arguably the most famous are partitioned based algorithms, where data is iteratively reassigned to clusters until an optimization function is minimized. The prototypical example of a partitioned based clustering algorithm is K-means. Given a natural number k, the K-means algorithm seeks to partition the data-set into k distinct groups that minimize the variance within the clusters.
Another popular method of clustering is spectral clustering, whereby one leverages spectral graph theory to separate the data into clusters. In spectral clustering, an undirected weighted graph G is formed, where each vertex is a data point and the edge weight is a chosen affinity between vertices.
Let W = (w i,j ) n i,j=1 denote the weighted adjacency matrix for the graph G. The (unnormalized) graph Laplacian L of W is a matrix that captures the combinatorial properties of the Laplacian on discrete data. The Laplacian L is a symmetric positive semi-definite matrix, so the eigenvalues may be ordered 0 = λ 1 ≤ λ 2 ≤ . . . ≤ λ n . Finding the eigenvectors e j corresponding to the lowest k eigenvalues, define U = [e 1 |e 2 | . . . |e k ] and cluster the rows using k-means. For more details, see [11].
However, K-means and spectral clustering requires the user to choose k, and additional heuristics are needed to constrain the search space. In spectral clustering, one can use the eigenvalues of the Laplacian L to determine the cluster number. Specifically, as the eigenvalues are ordered, search for a value of k such that the first λ 1 , . . . λ k are small, and λ k+1 is large. This method is justified by the fact that the spectral properties of L are closely related to the connected components of G [15]. Use of graph Laplacian eigenvalues to decide the cluster number k is called the eigen-gap heuristic.

Graph cut clustering
Given a notion of distance of data, the adjacency graph or matrix records the pairwise similarity. Clustering the data X into k clusters is equivalent to providing a k−cut of the adjacency graph G. Graph cut strategies vary depending upon application. For example, the min cut algorithm minimizes the cost between components of the graph, but this can result in an undesirable clustering, e.g., a cluster with one element.
The Ratio cut algorithm is a graph cut that seeks to ameliorate this issue by incorporating the size of each component. Concretely, let I ⊂ {1, 2, . . . , n}, I c is the complement of I, and W (I) := i∈I,j∈I c w i,j . Given disjoint subsets I 1 , I 2 , . . . I k such that I j = {1, 2, . . . , n}, its ratio cut is defined as Finding I 1 , . . . I k such that Equation 1 is minimized is NP-hard [16]. However, a solution to a relaxed ratio cut problem can be obtained using spectral clustering [15].
This MIER algorithm will require a graph cut of a particular adjacency matrix formed by a large ensemble of coarsegrain clusterings. As discussed in Section 3.2, we perform a ratio cut on the adjacency graph formed using normalized mutal information. Consequently, to implement a ratio cut in the MIER algorithm, we will use spectral clustering on this adjacency matrix.

Mutual Information
Mutual information provides a method to quantify the shared information. Here, we outline how the mutual information is computed. For a more detailed account of mutual information and clustering, see [2] and [14].
be a collection of data points. Suppose that we partition the data X into two clusterings The entropy of the clustering U , denoted H(U ) is the average amount of information (e.g., in bits) needed to encode the cluster label for each data points of U . If the clustering V is known, U can be encoded with less bits of information. The conditional entropy H(U |V ) denotes the average amount of information needed to encode U if V is known.
The mutual information I(U, V ) measures how knowledge of one clustering reduces our uncertainty of the other. Formally, Explicit formulas for H(U ) and H(U |V ) can be derived as follows. Let n i,j to be the size of V j . Assume points of X are sampled uniformly. Then the probability that a random point in x ∈ X is in cluster U i is p(x) = ai n . Moreover, the probability that points x, y ∈ X satisfy x ∈ U i and y ∈ V j is p(x, y) = ni,j n . Therefore, it follows that ni,j n log ni,j /n bj /n , which yields, Notice that I(U, V ) ≥ 0, and I(U, V ) = I(V, U ). It then follows that Therefore, one can normalize the mutual information to take on values in [0, 1]. Equation 2 shows there is more than one way to do this -for example, one can divide the mutual information either by the minimum or the average of the entropies [14]. There are, in fact, many ways to normalize the mutual information, each with their own benefits and downsides [14]. Throughout, we normalize using the average value, and therefore define the normalized mutual information as

The CGC and MIER algorithms
Here, we present our wavelet-based clustering model for classifying slices of tensor data. We detail the clustering algorithm Coarse-Grain Clustering (CGC) and present a method for selecting clusters to include in an ensemble based off the mutual information, which we call Mutual Information Ensemble Reduce (MIER).

Coarse-Grain Clustering (CGC)
This manuscript considers 4-way climate data tensors X ∈ R N1×N2×N3×N4 . We will index the modes of the tensor using subscripts, namely i1,i2,i3,i4=1 . Each of the coordinates coordinates i 1 , . . . , i 4 describes a feature of the abstract dataset X . Correspondingly, we will always make the following physical identifications: the first and second indices i 1 and i 2 refer to latitude and longitude coordinates, respectively; the index i 3 denotes time, and i 4 refers to a state variables (e.g., temperature or precipitation).
The goal of this work is to provide meaningful clusterings for the spatial location, namely the coordinates corresponding to i 1 and i 2 . Hence, we seek clusterings of the indices (i 1 , i 2 ) ∈ {1, 2, . . . , N 1 } × {1, 2, . . . , N 2 } using the data X .
While our focus is on clustering two indices of 4-way tensors, we note that this method does generalize to clustering d-way tensors along any number of indices.
We now describe the Coarse-Grain Clustering algorithm. Figure 3 schematically displays the key features of CGC, while Algorithm 1 contains the pesudo-code. Step One -Split Tensor: The first step in the coarse-grain clustering (CGC) algorithm is to separate the tensor X into sub-tensors that are largely statistically uncorrelated across the dataset. For example, temperature and precipitation are locally correlated -e.g., seasonal rainfall. However, they are weakly correlated at large spatial scales. Indeed, there are hot dry deserts, cold dry deserts, wet cold regions, and wet hot regions. Therefore in the climate dataset X , one would separate by climate variables, but not by space or time. In a generic, non-climate specific tensor, one might split across different variables or runs of an experiment. We let X 1 , X 2 , . . . , X N4 be the 3-way tensors obtained by fixing the i 4 index to the N 4 possible values. Note that each of these tensors X l for l = 1, . . . , N 4 have the same size.
Step Two -DWT: After splitting the tensor X into sub-tensors, the next step is to select the inputs. The user chooses wavelets for each of the remaining indices i 1 , i 2 and i 3 . We let w j denote the wavelet for index i j , j = 1, 2, 3. Non-negative integers j for j = 1, 2, 3 are selected to control the level of the DWT on index i j . For each 3-way climate variable tensor X l , take the DWT transform.
Step Three -Stack: Since the same wavelets are used on each X l , the DWT of X l will each have the same shape. These tensors can therefore be stacked along the face we wish to classify. For the climate biome problem, this would be the (i 1 , i 2 ) face.
Step Four -Vectorize: Once the approximation coefficients are stacked, they may be vectorized along the face of interest. These vectors will be clustered according to a clustering algorithm of choice. This will result in a clustering of the face of interest on the DWT stack.
Step Five -Clustering: The final input is the choice of clustering algorithm, as well as any hyper-parameters required for the chosen algorithm. For example, one may choose K-means, in which case the user needs to specify the number of clusters k. Let C denote the chosen clustering algorithm, along with its chosen hyper-parameters. With the inputs chosen, the algorithm proceeds as follows. Algorithm C is applied to the vectorized DWT coefficients from step four.
Step Six -Return Labels: The final step is to translate these labels on the coarse-grain stack to the face of the original data set. This is done using the inverse DWT. Specifically, cluster labels corresponding with the largest value appearing in the inverse DWT filter are used to propagate the coarse label to finer detail.

Mutual Information Ensemble Reduce (MIER)
The CGC algorithm describes how to produce a single clustering at a fixed coarse-graining. This coarse-graining arises from the choice of wavelets and wavelet levels {w j , j } 3 j=1 . The power behind CGC is its ability to produce many clusterings by simply varying the wavelet levels j , j = 1, 2, 3, which can be readily parallelized via a single instruction.
This process results in an ensemble of clusters, one that is potentially too big to analyze. In this section, we discuss a method to select a small subset of this large ensemble of coarse-grainings. Our method leverages the mutual information to find a compact subset of clusters that contains most of the information across the large ensemble. This is accomplished by computing the mutual information between all the clusters in the large ensemble. This results in a connected graph. This connected graph is then ratio-cut to find heavily connected and therefore information theoretically similar clusters. For each component, we again use mutual information to select a single representative of the component. We call this method Mutual Information Ensemble Reduce (MIER).
Given a cluster U from the large ensemble, one can look at which clique it belongs to in the graph cut. By construction of the MIER algorithm, the chosen representative contains a large amount of the information contained in U . The MIER slgorithm is summarized in Figure 4 and Algorithm 2. The details of the algorithm are as follows.  Figure 4: Diagram of each step of the MIER algorithm. In step (2), thicker lines correspond to larger mutual information.
In step (5), the clustering with largest A(U ) is highlighted.
Step One -Large Ensemble: Let L ⊂ N 3 denote the permissible set of wavelet resolutions ( 1 , 2 , 3 ) for the chosen wavelets {w j , j } 3 j=1 . Reasonable values for L can be deduced from the dataset and problem of interest, e.g. scale of data and anticipated importance of embedded features. Once L has been decided, CGC is run for each = ( j ) 3 j=1 ∈ L. We denote the clustering using the wavelet resolutions by U . This results in an ensemble of clusters {U } ∈L .
Step Two -Mutual Information: Next, we compute the normalized mutual information between each clustering U in our ensemble L. This results in a complete weighted graph G on nodes indexed by the the set L. The weight between node = ( 1 , 2 , 3 ) and node = ( 1 , 2 , 3 ) is the normalized mutual information N I(U , U ). We call the graph G the mutual information graph, and let W denote the weighted adjacency matrix for G.
Step Three -Graph Cut: Having built the mutual information graph, we now perform a graph cut. Recall, spectral clustering solves a relaxed version of the ratio cut problem. Hence, we use spectral clustering on W to find a ratio-cut of G. The eigen-gap heuristic is used when selecting the number of clusters k for spectral clustering W [15]. Let L 1 , L 2 , . . . L k denote the k components of L corresponding to the k−cut of G.
Step Four -Average N I: For each component of the cut mutual information graph, we seek a best representative. Let A(U ) denote the average mutual information between U and all other members of its component. That is, for ∈ L j , where N I(U , U ) is the normalized mutual information between the clusters U and U ).
Step Five -Choose Representative: For each j = 1, . . . k, the goal is to select the clustering U that best represents all the clusterings in L j . If U is a good representative for all the other clusterings within its component, then the mutual information between U and the other members of the component will be high on average. Thus, A(U ) will be large. Consequently, we select a cluster in L j for which A is maximized: As a proof of concept, we apply the MR-Cluster to a gridded historical climate data set of North America [8], referred to hereafter as L15. This data set ingests station data and interpolates results for each grid point, integrating the effects of topography on local weather patterns. The gridded data is six by six kilometers a side and consists of 614 latitudinal, 928 longitudinal, and 768 temporal steps for the years 1950-2013. The available monthly variables in the L15 data set are averaged values of daily total precipitation, daily maximum temperature, daily minimum temperature, and daily average wind speed. A representative snapshot of precipitation, maximum and minimum temperature is shown in Figure 5. The datasets contains key inputs needed for biome classification using the KG model [7] and allows ready comparison against this expert judgement based approach. As this dataset is freely available, as well as widely used within the climate community (e.g., Henn et al. 2017, cited over 130 times), it provides a good benchmark application to illustrate capabilities of the method, especially in comparison against more typical expert judgement approaches like the KG model.  Figure 5: Representative variables within the from L15 dataset, illustrating broad range of both coarse and fine spatial scales for precipitation, maximum temperature, and minimum temperature.

CGC Hyperparameter Selection for L15
The first step of CGC is to split the tensor X into sub-tensors corresponding to the climate variables. The historical precedent has been to use temperature and precipitation data to prescribe the biomes [7,10,13,17]. Hence, we will only consider the sub-tensors X 1 , X 2 , X 3 corresponding to precipitation, maximum temperature, and minimum temperature. The next step is to determine the inputs to Algorithms 1 and 2. We describe these now.
L15 is a gridded observational dataset that achieves a six km spatial resolution, while each timeslice of the data represents monthly timescale data. Whenever a wavelet transform is taken, the spatial and/or temporal scales are approximately doubled. For example, the L15 dataset has a six km spatial resolution. Thus, the coarse wavelet coefficients have a spatial resolution of 12, 24, 48, etc., km for one, two, and three wavelet transforms respectively. Similarly, wavelet transforms of the monthly time scales will result in 2, 4, 8, etc., month long scales.
There is a scale at which both the spatial and temporal information is too coarse and begins to lose meaning. For example, on one extreme the spatial scale of the entire dataset is meaningless. On the other, the six km initial resolution is too fine scale for adequate characterizations into distinctly visible biomes at the North American scale.
These scales demarcate our set of permissible wavelet resolutions L. At least one wavelet transform is taken in both space and time. The maximum for the spatial indices 1 , 2 is four (roughly 96 km). The maximum number of temporal wavelet transforms is six (roughly 5 years). Further, we opted for a parsimony with regards to the spatial wavelet transforms-a wavelet transform is taken along i 1 (latitude) if and only if it is also taken along i 2 (longitude). For example, if we take two wavelet transforms in space laterally, we will also take two in space longitudinally so that horizontal spatial resolution is uniformly scaled. Thus, we have Note, while it was possible to push the maximum levels to coarser grain, we wanted to avoid the risk of over-coarsening the result. For our choice of wavelets, we choose Daubechies 2 (db2) to match the time signals and Haar for space, corresponding to anticipated smooth periodicities in time and sharp gradients, e.g., near mountains, in space.
For the algorithm A, we have chosen to use K-means clustering for various values of k due to the historical precedence this algorithm has in clustering for climate applications [10,17] and straightforward implementation. Recall that the aim is not to find the "best clustering" of our data; instead, we wish to understand how coarse-graining effects clustering and can be used to develop an ensemble of clusterings for use in understanding cluster method sensitivity to latent data scales.

L15 CGC algorithm results
Application of the CGC algorithm to the L15 dataset results in spatial mappings of unique, non-overlapping classifications. For example, resolution ( 1 , 2 , 3 ) and cluster number k effect the resultant clustering. Figure 6 explores sensitivity to the wavelet transform for a fixed value of k = 10. Note, several coherent features are observed. First, strong latitudinal dependence in the eastern portion of the US is consistent across clusterings as classified scales are modified, e.g., Figure 6a to 6d. Second, reduced spatial scale, e.g., in Figure 6b as compared to Figure 6a, results in a loss of high-spatial frequencies in the produced classification. Sensitivity to temporal scale, similar to spatial scale, produces large scale structural change but with higher spatial fidelity between classified regions, e.g., Figure 6a vs Coherent structures, such as the Rocky Mountains, are evident across all k, illustrating the overall consistency and reduced sensitivity of CGC to k as opposed to choice of wavelet parameter set L.

L15 MIER algorithm results
For different k values, the CGC algorithm was run across all the resolutions L as in Equation 3. For each fixed k, the MIER algorithm was applied to the outputs to discover the reduced ensemble. Figure 8 shows the aspects of the MIER algorithm for k = 10. Figure 8a displays the value of A(U ) for each resolution ∈ L. The value on the vertical access denotes the number of spatial wavelet transforms, while the horizontal axis displays the number of temporal wavelet transforms. Figure 8b shows the results of the ratio cut algorithm. The key resolutions found by running the MIER algorithm are highlighted in a darker shade.    The clusters plotted in Figure 8c through Figure 8f are the best representative clusterings found in the MIER algorithm. Each clustering encapsulates different observed features from the large ensemble of clusterings L. For example, decreased temporal scale increases resolution from two to three eastern US classifications and shown in Figure 8c versus 8d and 8f. Coarsened classifications are observed as a direct role of spatial scale,e.g., Figure 8f versus 8c to 8e). Cluster boundary shape is also effected by the wavelet resolution. For example, a vertical boundary can be found in the middle of the United States across each classification. However, the shape of that boundary depends on the resolution, e.g. Figure 8d versus 8e.

Discussion
CGC resolution dependence plots in Figure 6 highlight the variability that data resolution introduces into the clustering process. As expected, increasing the number of spatial wavelet transform results in a coarser clustering. High variance regions, such as the Rocky Mountains, become less resolved as the number of spatial resolutions increases. Large structural features such as The Great Plains are persistent across the spatial wavelet coarse-graining.  What is more unexpected is the effect that coarse-graining time has on the clustering. High variability regions remain high variability, however distinctly different clustering patterns do begin to emerge. For instance, how CGC clusters the Northern Rocky Mountains does seem to depend on the temporal resolution selected, which points to the role of high-altitude storms on resultant biome classification. Low variability regions also depend heavily on the temporal scale. For example, the North Eastern U.S. splits into more biomes as the temporal scale becomes coarser, illustrating that high-frequencies may appear as noise until reduced by the wavelet and may consequently mask a signal appropriate to more specifically classify a region.
The MIER algorithm massively reduces reduces the size of the large ensemble L. In all the experiments run, the size of L was 24, but the reduced ensemble size is between three and five, with the majority of the cases being four. This illustrates the success of the method in identifying characteristic, reduced set of clusterings. Furthermore, the algorithm is successful at picking resolutions that are sufficiently spaced apart. Consequently, the chosen clusters accurately represent the dynamical range of all the 24 clusters in the large ensemble. The reduction in clusterings from 24 to 5 greatly aids analysis and human comprehensibility of the output.
This can be seen by comparing Figures 6 and 8. The six sample plots in Figure 6 are the extreme cases (lowest and highest coarse-graining) and as well as some middle cases. By looking at Figure 8 we see that, for instance, the cluster U (1,1,1) belongs to component 0. The representative for component 0 is the cluster U (2,2,1) . There are a lot of visual similarities between U (1,1,1) on Figure 6 and U (2,2,1) on Figure 8. Indeed, U (2,2,1) appears to be a blend between U (1,1,1) and U (4,4,1) , which is another clustering that belongs to the same connected component.
As can be seen from the output of the MIER algorithm, the reduced ensemble can succinctly represent differences across the spatial temporal resolutions. Most of the variance seen between the clusterings at different resolutions is captured within this subset. From a numerical standpoint, the reduced ensemble is robust as well. As can been seen from Figure 8, the expected normalized mutual information between any representative and the other clusters in its component of the graph is usually rather large. Thus, MIER has successfully found a small, representative subset of the large ensemble.

Conclusion
We have shown that scale of data is a non-negligible feature with regards to clustering. Consequently, in addition to running several clustering algorithms, it is also important to include several coarse-grain clusterings into your cluster ensemble. To avoid ballooning the size of the ensemble, its crucial to not consider every possible coarse-graining, but rather a small subset that largely represents every possible resolution. The MIER algorithm has shown to be a good method to prune the size of the CGC ensemble. This capability to produce an ensemble of classifications representing the diversity of scales provides a direct pathway to better understand clustering sensitivities, illustrating a continued need to assess and mitigate uncertainties resultant from hyperparameter selection.
Introducing the ensemble of clusters from the CGC and MIER algorithm comes at the cost of complexity. It is more difficult to analyze a set of clusterings than a single clustering. As shown in Figure 1, the additional clusterings from the CGC and MIER framework should be imported into a consensus clustering algorithm. However, as discussed in the introduction, clustering is an ill-posed problem without a single optimal solution. Further study is needed to assess the confidence across the cluster ensembles within this classification approach.