Doubly Stochastic Neighbor Embedding on Spheres

Recently, Stochastic Neighbor Embedding (SNE) methods have widely been applied in data visualization. These methods minimize the divergence between the pairwise similarities of high- and low-dimensional data. Despite their popularity, the current SNE methods experience the"crowding problem"when the data include highly imbalanced similarities. This implies that the data points with higher total similarity tend to get crowded around the display center. To solve this problem, we normalize the similarity matrix to be doubly stochastic such that all the data points have equal total similarities. A fast normalization method is proposed. Furthermore, we show empirically and theoretically that the doubly stochasticity constraint often leads to approximately spherical embeddings. This suggests replacing a flat space with spheres as the embedding space. The spherical embedding eliminates the discrepancy between the center and the periphery in visualization and thus resolves the"crowding problem". We compared the proposed method with the state-of-the-art SNE method on three real-world datasets. The results indicate that our method is more favorable in terms of visualization quality.


Introduction
Information visualization by dimensionality reduction facilitates a viewer to quickly digest information in massive data. It is therefore increasingly applied as a critical component in scientific research, digital libraries, data mining, financial data analysis, market studies, manufacturing production control and drug discovery, etc. Numerous dimensionality reduction methods have been introduced, ranging from linear methods such as Principal Component Analysis to nonlinear methods such as Multidimensional Scaling (MDS), [MDS;14] , Isomap [13] , Locally Linear Embedding [10] , Gaussian Process Latent Variable Models [6] . A survey on nonlinear dimensionality reduction has been given by van der Maaten et al. [9] . Aspects in Multidimensional Scaling are discussed by Buja et al. [1] .
* Corresponding author at: Department of Computer Science, Norwegian University of Science and Technology, Norway.
especially for displaying clusters in data. An SNE method takes as input the pairwise similarities between data points in the highdimensional space and tries to preserve the similarities in a lowdimensional space by minimizing the Kullback-Leibler divergence between the input and output similarity matrices. The input to SNE is a similarity matrix or the affinity matrix of a weighted graph. When the node degrees of the graph are highly imbalanced, SNE tends to place the high-degree nodes in the center and the low-degree ones in the periphery, regardless of the intrinsic similarities between the nodes. Therefore, SNE often experiences the "crowding-in-the-center" problem for highly imbalanced affinity graphs.
We propose two techniques to overcome the above-mentioned drawback. First, we impose a doubly stochasticity constraint on the input similarity matrix. Two-way normalization has been shown to improve spectral clustering [16] and here we verify that it is also beneficial for data visualization. Moreover, if the neighborhood graph is asymmetric, for example, k -Nearest-Neighbors ( k NN) or entropy affinities [8,15] , we provide an efficient method for converting it to a doubly stochastic matrix.
Second, we observe that the data points are often distributed approximately around a sphere if the input similarity matrix is doubly stochastic, and we provide a theoretical analysis of this phenomenon. Our analysis suggests replacing the two-dimensional Euclidean embedding space with spheres in the three-dimensional space. Since there is no global center or periphery on the sphere geometry, the visualization is then naturally free of "crowding-inthe-center" problem. Moreover, we present an efficient projection step for adapting an SNE method with the spherical constraint.
We tested the proposed method on several real-world datasets and compared it with the state-of-the-art SNE method, t-SNE [8] .
The new method is superior to t-SNE in resolving the crowding problem and in preserving intrinsic similarities.
In the next section we briefly review SNE methods. We then discuss doubly stochastic similarity matrix and spherical embedding in Sections 3 and 4 , respectively. We present experimental results in Section 5 and conclusions in Section 6 .

Stochastic Neighbor Embedding
Stochastic Neighbor Embedding [SNE; 4] is a nonlinear dimensionality reduction method. Given a set of multivariate data points { x 1 , x 2 , . . . , x n } , where x i ∈ R D , their neighborhood is encoded in a square nonnegative matrix P , where P ij is the probability that x j is a neighbor of x i . SNE finds a mapping x i → y i ∈ R d for i = 1 , . . . , n such that the neighborhoods are approximately preserved in the mapped space. Usually the mapping is defined such that d = 2 or 3, and d < D . If the neighborhood in the mapped space is encoded in Q ∈ R n ×n such that Q ij is the probability that y j is a neighbor of y i , the SNE task is to minimize the Kullback-Leibler divergence Symmetric Stochastic Neighbor Embedding [s-SNE; 8] is a variant of SNE. Given input similarity p ij ≥ 0, s-SNE minimizes Kullback-Leibler divergence between the matrix-wise normalized similarities P i j = p i j / ab p ab and Q i j = q i j / ab q ab . The output similarity q ij is typically chosen to be proportional to a Gaussian distribution so that q i j = exp − y i − y j 2 , or proportional to a Cauchy distribution so that Here 4 j P i j (y i − y j ) or 4 j P i j (y i − y j ) q i j can be interpreted as the attractive force for y i , while −4 j Q i j (y i − y j ) or −4 j Q i j (y i − y j ) q i j as the repulsive force.

Doubly stochastic similarity matrix
The input to s-SNE, P , is a nonnegative and symmetric matrix and can be treated as the affinity matrix of an undirected weighted graph. If the degree (i.e., row sum or column sum of P ) distribution of nodes is highly non-uniform, then the high-degree nodes will usually receive and emit more attractive force than the average nodes during the iterative learning. As a result, these nodes often glue together and form the center of display. On the other hand, the low-degree nodes tend to be placed in the periphery due to less attraction. This behavior is often undesired in visualization because it only reveals the data centrality but hinders the discovery of other useful patterns, and may be directly misleading when some high-degree nodes are actually disconnected in the underlying data.
To overcome the above drawback, we can normalize the graph affinity such that the nodes have the same degree. For undirected graphs, this can be implemented by replacing the unitary matrixwise sum constraint i j P i j = 1 in s-SNE with the doubly stochasticity constraint, i.e., i P i j = j P i j = 1 .
Given a non-normalized matrix, we can apply Sinkhorn-Knopp [11] or Zass-Shashua method [16] to project it to the closest doubly stochastic matrix P . In this work we use the former because it can maintain the sparsity of in the similarity matrix, which is often needed for large-scale tasks. Given a non-normalized similarity matrix S , the Sinkhorn-Knopp method initializes P = S and iterates the following update rules until P has converged: for all i, u i ← j P ij , and then for all i, j , Alternatively, the neighborhood information in highdimensional space can be encoded in an asymmetric matrix B ≥ 0 with n rows, for example, the k NN graph or the entropy affinities [8,15] . B can also be a non-square dyadic data such as document-term or author-paper co-occurrence matrix. In these cases, we can apply the following steps to construct a doubly stochastic matrix: suppose It is easy to verify that by this construction P is symmetric and doubly stochastic. The calculations of A and P are performed only once and are thus computationally much more efficient than Sinkhorn-Knopp method which needs iterative steps. Here the matrix A ik can be treated as the random walk probability from the i th row index to the k th column index and P ij is interpreted as the two-step random walk probability between two row indices i and j via any column index k (with uniform prior over row indices).

Spherical embedding of doubly stochastic similarity matrices
When the input similarity matrix is doubly stochastic, we find that s-SNE often embeds the data points around a sphere in the low-dimensional space. The phenomenon is illustrated in Fig. 1 , where we generated a 20 0 0 × 20 0 0 similarity matrix with uniform distribution and visualize it by t-SNE. We can see from the left subfigure that the embedding is close to a ball. In contrast, if the matrix is doubly stochastically normalized (by using the Sinkhorn-Knopp method), the resulting embedded points approximately lie around a circle. The same phenomenon also holds for 3D visualizations.
We provide a theoretical analysis of this phenomenon. If P is doubly stochastic, then Q is often approximately doubly stochastic (up to a constant factor) because it approximates P by the KL-divergence. That is, j Q ij is approximately the same for all i . For example, in Fig. 1 (right), j Q ij mainly distribute around a constant (with mean 0.0 0 05 and very small standard deviation 1 . 7 × 10 −6 ). In this case, we show that j y i − y j 2 be-comes approximately the same for all i , bounded by constants, in Proposition 4.1 . Furthermore, we show that when j y i − y j    The propositions show that embeddings are often nearly spherical for doubly stochastic similarity matrices. Therefore it is more suitable to replace the 2D Euclidean embedding space with spheres in 3D space. The resulting layout can be encoded with n × 2 + 1 numbers (two angles for each data point plus the common radius). Therefore the embedding is still intrinsically two-dimensional.
The spherical geometry itself brings other benefits for visualization. First, the embedding in the Euclidean space has a global center in the middle, while on spheres there is no such global center. Therefore a spherical visualization is free of the "crowdingin-the-center" problem. Every point on the sphere can be a local center, which provides fish-eye views for navigation and for examining patterns beyond centrality. Second, the attractive and repul-sive forces can be transmitted in a cyclic manner, which helps in discovering macro patterns such inter-cluster similarities.
We thus formulate our learning objective as follows: where J (Y ) is an SNE objective function with P doubly stochastic, Q defined in Section 2 and S = Y Y ∈ R n ×3 ; y 1 = · · · = y n ; We call the new method Doubly Stochastic Neighbor Embedding on Spheres (DOSNES). It is important to notice that our formulation is more flexible than other works on spherical embeddings (e.g., [2,3,7] ). In DOSNES, the solution space S includes all centered spheres in the three-dimensional space, not only the sphere with unit or pre-fixed radius. Moreover, we do not require normalization of the input vectors. Detailed comparison with related work is given in Section 2 of the supplemental document.
We employ a projection step after each SNE update step to enforce the sphere constraint. The DOSNES algorithm steps are summarized as follows: 1. Normalize P to be doubly stochastic.
The projection step 2b is performed by implicitly switching Y = [ ˜ y 1 , . . . , ˜ y n ] T to the spherical coordinate system, taking the mean radius, and switching back to Cartesian coordinates. This is imple-mented as: For i = 1 , . . . , n where ˜ y i = ˜ y i − 1 n j ˜ y j . The iterations converge to a stationary point with suitable learning step sizes [see e.g., 5 , Section 5 ].

Experiments
We developed a browser-based software for displaying and navigating the DOSNES results. The software and its demos can be found in the project website. 1 In the paper we present the 2D projected views of the spheres. We compare our proposed method DOSNES with two-and three-dimensional t-SNE 2 as well as non-metric MDS 3 in Euclidean embedding space [8] to verify the effectiveness of using doubly stochastic similarities and the sphere constraint.
The compared methods were tested on three real-world datasets from different domains: (1) NIPS : 4 the proceedings of NIPS conferences  which contains 5993 papers and their associated 6621 authors. We used the largest connected component in the co-author graph with 5300 papers and 5422 authors. The (non-normalized) similarity matrix is from the co-author graph, i.e., BB T where B is the author-paper co-occurrence matrix.
(2) WorldTrade : 5 trade network of metal manufactures among 80 countries in 1994. Each edge represents the total trade amount (imports and exports) between two countries. MDS requires a distance matrix as input. Given a similarity matrix S , we first normalize S i j = S i j / max (S) . Treating S i j s as cosine similarities, we obtain the cosine distances by D i j = 1 − S i j . Next we calculate the shortest graph distances between all nodes and feed them to MDS.
The NIPS co-author graph is visualized in Fig. 2 . The node degrees of the graph are highly uneven, where many authors have only one paper while the most productive author has 93 papers. In Fig. 2 (a) and (b), we can see both 2D and 3D t-SNE caused the most productive NIPS authors crowded in the center. This is undesirable because these authors actually do not often co-author NIPS papers. For example, Hinton_G has no coauthored paper with Schölkopf_B but they are very close in the t-SNE layout. A similar crowding problem is observed in the MDS visualizations. In Fig. 2 (e) and (f), DOSNES resolves neatly the crowding problem, by normalizing the similarity matrix with our method in Section 3 and visualizing the authors with spherical layout. The productive NIPS authors are now more evenly distributed. For example, Hinton_G becomes more distant to Schölkopf_B . Meanwhile, retrieval around the most established authors reveals accurate co-authorship. For example, Revow_M , Nair_V and Brown_A are close to Hinton_G because all their NIPS papers are co-authored with Hinton_G . See our online demo 7 for more details.
The visualizations of the WorldTrade graph are given in Fig. 3 . In this graph, some countries such as United States and Germany have more total trade amount than many others. 2 https://lvdmaaten.github.io/tsne/ 3 We used the isoMDS() function in the MASS R package. 4 Fig. 3 (a)-(d), we can see both 2D and 3D t-SNE, as well as the MDS visualizations, caused these countries crowded in the center. In contrast, DOSNES places the countries more evenly. In Fig. 3 (e) and (f), we can see on the sphere many meaningful clusters (e.g., Europe and Asia ) which well match the geography even though we did not use such information in the training. See our demo globe 8 for other viewpoints. The effectiveness of DOSNES can be quantified by using the WorldTrade and MIREX data sets where ground truth classes are available. We performed K-means clustering on the DOSNES and t-SNE embeddings. The resulting cluster purities are reported in Table 1 (top). We can see that DOSNES achieves significantly higher purity for both data sets. We also recorded the running time of DOSNES and t-SNE for the data sets. See Table 1 (bottom). DOSNES requires almost the same time as t-SNE, which shows that DOSNES improves t-SNE at negligible additional cost.

Conclusions
We have presented a new visualization method for highdimensional and graph data. The proposed DOSNES method is based on the Stochastic Neighbor Embedding principle but with two key improvements: we normalize the input similarity matrix to be doubly stochastic and replace the 2D Euclidean embedding space with spheres in 3D space. Empirical results show that our method significantly outperforms the state-of-the-art approach t-SNE in terms of resolving the crowding problem and preserving intrinsic similarities.

Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Supplementary material
Supplementary material associated with this article can be found, in the online version, at doi: 10.1016/j.patrec.2019.08.026 .