Uncovering High-dimensional Structures of Projections from Dimensionality Reduction Methods

Projections are conventional methods of dimensionality reduction for information visualization used to transform high-dimensional data into low dimensional space. If the projection method restricts the output space to two dimensions, the result is a scatter plot. The goal of this scatter plot is to visualize the relative relationships between high-dimensional data points that build up distance and density-based structures. However, the Johnson–Lindenstrauss lemma states that the two-dimensional similarities in the scatter plot cannot coercively represent high-dimensional structures. Here, a simplified emergent self-organizing map uses the projected points of such a scatter plot in combination with the dataset in order to compute the generalized U-matrix. The generalized U-matrix defines the visualization of a topographic map depicting the misrepresentations of projected points with regards to a given dimensionality reduction method and the dataset.• The topographic map provides accurate information about the high-dimensional distance and density based structures of high-dimensional data if an appropriate dimensionality reduction method is selected.• The topographic map can uncover the absence of distance-based structures.• The topographic map reveals the number of clusters in a dataset as the number of valleys.


Specifications
This article is divided into four parts. The first describes the unsupervised artificial network of selforganizing maps (SOMs) in general; the second part describes the application of simplified emergent self-organizing maps (sESOM) for projection methods and the third part describes the visualization of the topographic map with hypsometric tints based on the output of sESOM. The last part shows the application in three examples. The method is part of [30] ; it is a co-submission of that article (ARTINT_103237) and the method's description originates from several sections of the Ph.D. thesis, "Projection-Based Clustering through Self-Organization and Swarm Intelligence" [27] .

Emergent self-organizing map (ESOM)
Self-organizing (feature) map (SOM) was invented by [13] , [14] and is a type of unsupervised neural learning algorithm. In contrast to other neural network models 1 a SOM consists of an ordered twodimensional layer of neurons called units. Neurons are interconnected nerve cells in the human neocortex [H. [25] , p. 22], and the SOM approach was inspired by somatosensory maps (e.g. see [ [9] , p. 421] cites [8] , see also [ [12] , p. 335]). There are two types of SOM algorithms: online and batch [7] . The first is stochastic, whereas the second is deterministic, which means that it yields reproducible results for a given parameter setting. However, Fort et al. have argued "that randomness could lead to better performances" [ [7] , p. 12].
The main differences between batch-SOM [16] and online-SOM [15] lie in the updating and averaging of the input data. In batch-SOM, prototypes (see Eq. (1) below) are assigned to the data points and the influences of all associated data points are calculated simultaneously, in contrast to online-SOM, in which sequential training of the neurons is applied (as described in detail below). The batch-SOM method has been shown to produce topographic mappings of varying quality depending on the pre-defined parametrization [7] , and "the representation of clusters in the data space on maps trained with batch learning is poor compared to sequential training" [21] . An important comparison between the batch-SOM approach and ant-based clustering was presented by [10] and will be elaborated upon in chapter 7. No objective function is used in online-SOM [ [17] , p. 241], and SOM remains a reference tool for two-dimensional visualization [ [17] , p. 244].
In one common approach to applying the SOM concept, the algorithm acts as an extension of the k-means algorithm [4] or is a partitioning method of the k-means type [20] . In such a case, only a few units are used in the SOM algorithm to represent the data [23] , which results in direct clustering of the data. Here, each neuron can be considered to represent a cluster. For example, Cottrell [3] ). Therefore, the conventional SOM algorithm is called k-means-SOM here. This SOM algorithm also has two common extensions called Heskes-SOM [11] and Cheng-SOM; these two extensions include objective functions [1] and are not discussed further in this thesis. The optimization of objective functions in general will be discussed in chapter 6, where it will be argued that it is not useful for the goal of this thesis. Chapter 7 will show that objective functions are incompatible with selforganization.
The other approach to applying SOM is to exploit its emergent phenomena through selforganization, in which case it is necessary to use a large number of neurons ( > 40 0 0) [31] . This enhancement of the online-SOM approach is called emergent SOM (ESOM). In such a case, the neurons serve as a projection of the high-dimensional input space instead of a clustering, as is the case in kmeans-SOM.
Let M = { m 1 , . . . , m n } be the positions of n neurons on a two dimensional lattice 2 (feature map) and . . n } the corresponding set of weights or prototypes of n neurons, then, the SOM training algorithm constructs a non-linear and topology-preserving mapping of the input space I by finding the best matching unit (bmu) for each l ∈ I : (1) a distance in the input space I between the point l and the prototype w i is denoted.
In each step, SOM learning is achieved by modifying the prototypes (weights) in a neighborhood in Eq. (2) .
The cooling scheme is defined by the neighborhood function h : where the radius R decreases until R = 1 in accordance with the definition of the maximum number of epochs. In contrast to all previously introduced projection methods, no objective function is used in the ESOM algorithm. Instead, ESOM uses the concept of self-organization (see chapter 6 for further details) to find the underlying structures in data.
The structure of a (feature) map is toroidal ; i.e., the borders of the map are cyclically connected [31] , which allows the problem of neurons on borders and, consequently, boundary effects to be avoided. The positions m ∈ M of the BMUs exhibit no structure in the input space [31] . The structure of the input data emerges only when a SOM visualization technique called U-matrix is exploited [34] .
Let N ( j ) be the eight immediate neighbors of m j ∈ M , let w j ∈ W be the corresponding prototype to m j , then the average of all distances between prototypes w i [34] . The U-matrix technique that is generally applicable for all projection methods and can be used to visualize both distance-and density-based structures [27] , [35] . This visualization technique is the further development of the idea that the U-matrix can be applied to every projection method [33] .
In this work, the visualization technique results in a topographic 3D landscape. Here, the requirements are a heavily modified emergent self-organizing map (ESOM) algorithm and a method of high-dimensional density estimation. Contrary to [33] , the process of computing the resulting topographic map is completely free of parameter dependence and accessible by simply by downloading the corresponding R package [Thrun/Ultsch, 2017b].

Simplified ESOM
To calculate a U * -matrix for any projection method, a modified ESOM algorithm is required. The first step is the computation of the correct lattice size. On the x axis, let the lattice begin at 1 and end at a maximal number denoted by Columns C (equal to the number of columns in the lattice); similarly, on the y axis, let the lattice begin at a maximal number denoted by Lines L and end at 1. Then, the first condition is expressed as The second condition is that the lattice size should be larger than NN 3 : The first condition (I.) implies that the lattice size should be as close to equal to the size of the coordinate system as possible. The second condition (II.) is required for emergence in our algorithm (for details, see [31] ). The resulting equation to be solved is Eq. (4) which yields Eq. 5: After the transformation from the projected points 4 p ∈ O to points on a discrete lattice, the points are called the best-matching units (BMUs) bmu ∈ B ⊂ R 2 of the high-dimensional data points j, analogous to the case for general SOM algorithms with fgrid : O → B , p → bmu, where fgrid is surjective when conditions (i) and (ii) are met.
To develop the algorithm illustrated in Listing 1 , the idea of [33] , in which it was suggested to "apply Self-Organizing Map training without changing the best match[ing unit] assignment", was adopted. However, in contrast to [33] , here, the transformation fgrid is defined precisely to calculate  Figure  1 (right), which divide the blue cluster in two parts, are ignored.

Fig. 2.
Chainlink data set (right) and PCA projection. The projection suffers from local errors in two small areas around a low number of points, but the projection is unable to visualize them.  the BMU positions and the structure of the lattice is toroidal; i.e., the borders of the lattice are cyclically connected [31] .
Based on the relevant symmetry considerations, 5   neighborhood function h is defined in Eq. (6) : In sESOM, learning is achieved in each step by modifying the weights in a neighborhood in Eq. (7) : In contrast to [33] , the algorithm does not require any input parameters, and the resulting visualization is not a two-dimensional gray-scale map but rather a topographic map with hypsometric tints [28] . The entire algorithm is summarized in Listing 1 .

Topographic map with hypsometric
The U * -matrix visualization technique produces a topographic map with hypsometric tints [28] . Hypsometric tints are surface colors that represent ranges of elevation [22] . Here, a specific color scale is combined with contour lines.
The color scale is chosen to display various valleys, ridges and basins: blue colors indicate small distances (sea level), green and brown colors indicate middle distances (low hills), and shades of white colors indicate large distances (high mountains covered with snow and ice). Valleys and basins represent clusters, and the watersheds of hills and mountains represent the borders between clusters ( Fig. 2 and Fig. 6 ).
The landscape consists of receptive fields, which correspond to certain U * -height intervals with edges delineated by contours. This work proposes the following approach (see [ [28] , p. 10]): First, the range of U * -heights is split up into intervals, which are assigned uniformly and continuously to the color scale described above through robust normalization [18] . In the next step, the color scale is interpolated based on the corresponding CIELab color space [2] . The largest possible contiguous areas corresponding to receptive fields in the same U * -height intervals are outlined in black to form contours. Consequently, a receptive field corresponds to one color displayed in one particular location in the U * -matrix visualization within a height-dependent contour. Let u(j) denote the U * -heights, and let q01 and q99 denote the first and 99-th percentiles, respectively, of the U * -heights; then, the robust normalization of the U * -heights u(j) is defined by Eq. (8) : q 99 − q 01 (8) Fig. 6. Listing 1: sESOM pseudocode algorithm implements a stepwise iteration from the maximum radius Rmax which is given by the lattice size (Rmax = C/6) stepwise with one per step and down to 1. w ( m_k) indicates that the prototype w(m_k) of neuron m_k is modified by Eq. 7 Additionally, the search for a new best matching unit still is used and these prototypes may change during one iteration. The predefined prototypes are reset to the weights of their corresponding high-dimensional data points after each iteration.
The number of intervals in is defined by Eq. (9) : The resulting visualization consists of a hierarchy of areas of different height levels represented by corresponding colors (see Figure). To the human eye, the visualization using the generalized Umatrix tool is analogous to a topographic map; therefore, one can visually interpret the presented data structures in an intuitive manner. In contrast to other SOM visualizations, e.g., [26] , this topographic map presentation enables the layman to interpret sESOM results.
The use of a toroidal map for sESOM computations necessitates a tiled landscape display in the interactive generalized U-matrix tool [29] , which means that every receptive field is shown four times. Consequently, in the first step, the visualization consists of four adjoining images of the same generalized U-matrix [32] (the same is true for the U * -matrix). To obtain the 3D landscape (island 6 ), a shiny application can be used to cut an island out.

Summary
Restricting the Output space of a Projection method results in projection errors because the twodimensional similarities in the scatter plot cannot coercively represent high-dimensional distances. This is stated by the Johnson-Lindenstrauss lemma [5] and visualized in two examples. Fig. 1 and Fig. 3 show, how scatter plots lead to misleading interpretation of the high-dimensional structures.
However, scatter plots of projection methods remain the state of the art in cluster analysis as a visualization of distance and density-based structures (e.g., [ [6] , pp. 31-32; [19] , p. 25; G. [24] , p. 223; [9] , pp. 119-120, 6 83-6 84]). Thus, the projected points of such a scatter plot are used in a simplified emergent self-organizing map in order to compute a generalized U-matrix. This generalized U-matrix defines the visualization of a topographic map which provides more accurate information about the high-dimensional distance and density based structures of the data. The topographic maps of the three examples are visualized in Figs. 1-6 . For Fig. 6 the projections are not shown.

Declaration of Competing Interest
None.