Central Limit Theorems for Classical Multidimensional Scaling

Classical multidimensional scaling is a widely used method in dimensionality reduction and manifold learning. The method takes in a dissimilarity matrix and outputs a low-dimensional configuration matrix based on a spectral decomposition. In this paper, we present three noise models and analyze the resulting configuration matrices, or embeddings. In particular, we show that under each of the three noise models the resulting embedding gives rise to a central limit theorem. We also provide compelling simulations and real data illustrations of these central limit theorems. This perturbation analysis represents a significant advancement over previous results regarding classical multidimensional scaling behavior under randomness.


Background and Overview
Inference based on dissimilarities is of fundamental importance in statistics, data mining and machine learning [1], with applications ranging from neuroscience to psychology to economics [2].
In each of these fields, rather than directly observing the feature values of the objects, often we observe only the dissimilarities or "distances" between pairs of objects (inter-point distances). A common approach to manifold learning and subsequent inference problems involving dissimilarities is to embed the observed distances into some (usually Euclidean) space to recover a configuration that faithfully preserves observed distances, and then proceed to perform inference based on the resulting configuration. The popular Classical Multidimensional Scaling (CMDS) manifold learning method provides an example of such an embedding scheme into Euclidean space, in which we have readily available tools to perform statistical inference [3]. However, several recent papers have pointed out that there is still little known about the behavior of CMDS under noise: Fan et al. [4] write "[W]e are not aware of any statistical results measuring the performance of MDS under randomness, such as perturbation analysis when the objects are sampled from a probabilistic model."; Peterfreund and Gavish [5] write "To the best of our knowledge, the literature does not offer a systematic treatment on the influence of ambient noise on MDS embedding quality." This paper presents such an analysis.

Review of Classical Multidimensional Scaling
First we give a brief review of CMDS: Given an n × n hollow symmetric dissimilarity matrix D = [d ij ], the fundamental goal of multidimensional scaling is to find a suitable embedding dimension d and x 1 , . . . , x n ∈ R d such that the inter-point distance between x i and x j is "as close as possible" to the distance given in the dissimilarity matrix D. One of the most widely used multidimensional scaling techniques is the classical multidimensional scaling (CMDS) and involves the following steps: 1. Compute the matrix B = − 1 2 P D 2 P , where D 2 is D matrix entry-wise squared, and P = I − 11 n is the double centering matrix. Here I denotes the n × n identity matrix and 1 = (1, . . . , 1) ∈ R n . The resulting configuration X centers all points around the origin, resulting in an inherent issue of identifiability: X is unique only up to an orthogonal transformation. In the following presentation, we will write X = U B S B 1 2 W for some orthogonal matrix W for a suitably transformed X.

Noise Model and Embedding
Suppose we have the inter-point distances of n points in R d , and that the resulting distance matrix is D ∈ R n×n , i.e. D ij = ||x i − x j || 2 . As before, let D 2 denote the entry-wise square of D. Let ∆ be the distance matrix we observed (measured va a scientific experiment, say). A realistic error model could be ∆ = D + E, where we can think D as "signal" and E as "noise". Furthermore, the random matrix E should ideally satisfy the following conditions: (ii) E is hollow; that is, ∆ still has 0's on its diagonal.
(iii) Entries E ij are independent and V ar(E ij ) = σ 2 .
(iv) Each E ij follows a sub-Gaussian distribution.
(v) As a technical condition, we also require that max 1≤i≤n n j=1 D 2 ij log 4 n. This is a relatively mild condition to ensure the distance matrix is not too sparsely populated.
We then apply CMDS to ∆ to get the resulting configuration matrixX, and we use the following notations for this procedure: 2. Let SB ∈ R d×d be the diagonal matrix of d largest eigenvalues ofB and UB ∈ R n×d be the matrix whose orthogonal columns are the corresponding eigenvectors.
3. The matrixX = UBS 1 2 B ∈ R n×d is the " embedding of ∆" into R d .
A natural question arises regarding how the added noise affects the embedding configuration. That is, what is the relationship between the embedding X from D as in Section 1.1 and this embeddinĝ X from ∆? 3 3 Main Results

Sub-Gaussian Random Vectors
Recall that a random variable X is sub-Gaussian if P[|X| > t] ≤ 2e − t 2 K 2 for some constant K and for all t ≥ 0. Associated with a sub-Gaussian random variable is a Orlicz norm defined A random vector X in R n is called sub-Gaussian if the one-dimenisonal marginals X, x are sub-Gaussian random variables for all x ∈ R n , and the corresponding sub-Gaussian norm of X is defined as ||X|| ψ 2 = sup

Main Theorem
We have the following consistency results relating X andX: Let the noise matrix E satisfy the conditions in Section 2 and let ∆ = D + E.
Suppose that X andX are the CMDS embedding configurations in R d of D and ∆. Then there exists a sequence of orthogonal matrices {W n } ∞ n=1 ∈ R d×d , such that for any α ∈ R d and any fixed row index i, whereZ is the mean of Z i 's and Φ(α, Σ) denotes the CDF of a multivariate Gaussian with mean 0 and covariance matrix Σ, evaluated at α. Here is a covariance matrix depending on z, and Intuitively, this theorem claims that the rows ofX, after some orthogonal transformation, will be (approximately) centered around the rows of X when n is large. Furthermore, the more interpoint distances we have, the more tightly the rows ofX will center around the corresponding rows of X. The covariance of the centering will depend on the noise and the true distribution of the points in the underlying space. We refer the reader to the Appendix for a detailed proof of this theorem.
Remark 1. Note that the entries of ∆ should be nonnegative. Consider a modification of our conditions to have noise added only to large entries of D, provided that D has sufficiently many large entries. That is, if ||E ij || ψ 2 = γ, we will add noise only to entries of D for which D ij > γC for some constant C. In this case, a variant of our Theorem 3.1 can be established.

Three Point-mass Simulated Data
As a simple illustration of our CMDS CLT, we embed noisy Euclidean distances obtained from n points into R 2 . We consider three points x 1 , x 2 , x 3 ∈ R 2 for which the inter-point distances are 3,4 and 5 (these three points form a right triangle) and generate n k = π k n points equal to x k , Gaussians. For each n ∈ {50, 100, 500, 1000}, Figure 1 compares, for one realization, the theoretical vs. estimated means and covariances matrices (95% level curves). Table 4.1 shows the empirical covariance matrix for one of the point masses,Σ (1) , behaving in accordance with Theorem 3.1. Note that the blue and black centers and ellipses coincide for large n. 6   (1) , and entry-wise variance, via 500 simulations.

Remark 2.
In this simulation we relax the requirement that the entries of ∆ should be nonnegative in order to illustrate the phenomenon of decreasing covariance with increasing n.

Shape clustering
As a second illustration of the effect of noise on CMDS, we examine a more involved clustering experiment in the (non-Euclidean) shape space of closed curves. In this experiment, we consider boundary curves obtained from silhouettes of the Kimia shape database. Specifically, we restrict attention to three predefined classes of objects (bottle, bone, and wrench) and take from each class three different examples of shapes all given by planar closed polygonal curves representing the objects' outline. Figure 2 shows one instance for each of the bottle, bone, and wrench class.
A database of noisy curves is then created as follows: for each of the nine template shapes, we generate 100 noisy realizations in which vertices of the curve are moved along the curve's normal vectors with random distances drawn from independent Gaussian distributions at each vertex. This results in a total of 900 noisy versions of the initial curves such as the ones displayed in Figure 3.   We then compute the pairwise distance matrix between all the curves (including the noiseless templates) based on a shape distance which was introduced in [6] and later extended in the work of [7]. This type of metric is based on the representation of shapes in a particular distribution space called currents, see [7] for details. In our context, this metric offers several advantages: (i) the distance is completely geometrical in the sense that it is independent of the sampling of the curves and does not rely on predefined pointwise correspondences between vertices; (ii) it has an intrinsic smoothing effect that provides robustness to noise to a certain degree; (iii) it can be computed in closed form with minimal computational time which is critical given the large number of pairwise distances to evaluate. In this setting, we can view the resulting distance matrix as a 8 noisy perturbation of the ideal distances between the 9 template curves, which fits into the generic framework of our model. (Note that we leave aside the issue of checking the technical assumptions on the matrix E, which may be quite involved for this noise model and distance.) We proceed to perform CMDS on this distance matrix. A scree plot investigation shows that One immediate consequence of our Main Theorem is that when the variance of the noise E increases, the covariance matrix Σ gets larger. To further illustrate this point, we repeated the same experiment but with larger variance Gaussian noise on the curves before applying CMDS, and we observe that larger noise indeed leads to larger covariances of the clusters in the configuration space.
While these preliminary shape clustering results are obtained with a specific and simple distance on the space of curves, future work will investigate whether similar properties hold with different, more elaborate metrics and/or geometric noise models. The central limit theorem derived here could then constitute a useful theoretical tool to evaluate the discriminating power of 10 shape clustering methods based on CMDS.

Discussion
In [8], [9] and [10], the authors prove that adjacency spectral embedding of the random dot product graph gives rise to a central limit theorem for the rows of the latent positions. In this work we extend these results to distance matrix embedding.
We have avoided any discussion of the model selection problem of choosing a suitable embedding dimensiond. Instead, we assume d is known -except in Section 4.2. There are many methods for choosing (spectral) embedding dimensions, see [11,12,13].
A practically relevant and conceptually illustrative example comes from relaxing the assump-    CMDS is just one of a wide variety of multidimensional scaling techniques. Minimizing the raw stress criterion is another commonly used MDS technique [14]. This method seeks to minimize the raw stress, defined as σ r = σ r (X) = where d ij is the (i, j)th entry in the given distance matrix D and δ is a distance metric. We minimize σ r by an iterative algorithm which updates the configuration matrix X until the stopping criteria is met [15]. Keeping the simulation settings as in Section 4.1, the resulting configuration is shown in Figure 6. This suggests that the CLT may hold for raw stress just as well as for CMDS. However, this claim is at best a conjecture at present as the analysis seems significantly more involved.
Another possible, albeit naive model of ∆ is ∆ 2 = D 2 + E (as opposed to ∆ 2 = (D + E) 2 ) where E still satisfies the conditions in Section 2. This is the error model on the squared distance matrix D 2 rather than the distance matrix D. Our central limit theorem still holds for this situation, with a greatly-simplified covariance matrix. For the sake of completeness, we present this version below.
The proof is essentially a simplified version of the Appendix. exists a sequence {W n } ∞ n=1 ∈ R d×d such that for any α ∈ R d and any fixed row index i, we have whereZ is the mean of Z i 's and Φ(α, Σ) denotes the CDF of a multivariate Gaussian with mean 0 and covariance matrix Σ, evaluated at α. Here Proof. We have Proof. For any matrix H, the nonzero eigenvalues of H H are the same as those HH , so In what follows, we remind the reader that X is a matrix whose rows are the transposes of the column vectors X i , and Y is a d-dimensional vector that is independent from and has the same distribution as that of the X i . We observe that (X is a sum of n independent mean-zero sub-Gaussian random variables. By general Hoeffding's inequality [17], for all i, j ∈ [d], Davis-Kahan sin(Θ) theorem [18] gives for sufficiently large n. Note in the last equality we used the previous two lemmas. Thus, Recall that a random vector X is sub-exponential if P[|X| > t] ≤ 2e − t K for some constant K and for all t ≥ 0. Associated with a sub-exponential random variable there is a Orlicz norm defined as ||X|| ψ 1 = inf{t > 0 : E exp( |X| t ) ≤ 2}. Furthermore, a random variable X is sub-Gaussian if and only if X 2 is sub-exponential, and ||X 2 || ψ 1 = ||X|| 2 ψ 2 .
Lemma A.6. Let W * = W 1 W 2 , then asymptotically almost surely, ||W * SB −S B W * || F = O(log n) and ||W * SB Note R is the residual after projecting UB orthogonally onto the column space of U B , and thus where the minimization is over all orthogonal matrices W . By a variant of the Davis-Kahan sin Θ theorem [19], we have then we have

Now consider
Note here we use the fact UBSB =BUB. Now write then we have This gives If we denote U i be the ith column of U B , then for each i, jth entry, we have Recall, since X k 's are sub-Gaussian, thus equation (1) is a sum of mean zero sub-exponential random variables. By Bernstein's inequality [17], we have Lemma A.7. There exists an orthogonal matrix W , such that Proof. We havê Note we used the facts that U B U B B = B and UBSB 1 2 =BUBSB − 1 2 in the above equalities. Writing we have that Lemma A.8. There exists an orthogonal matrix W such that with high probability, ||X −XW || F = Proof. By Lemma A.7 , we havê Recall that A similar application of Hoeffding's inequality as in previous lemma gives Furthermore, we have and Together, we get Note we implicitly used the fact that ||S B , which can be proved completely analogous to Lemma A.6.
Theorem A.9. There exist orthogonal matrices W n ∈ R d×d , such that max asymptotically almost surely.
Proof. By previous theorem and with high probability Proof. For (5), note For (2), we have Consider (3), recall that X = U B S B To show (4), we must bound the Euclidean norm of the vector We now only need to bound the hth row of G 1 and G 2 .
Let us consider the G 1 term. Note, UB UB = I, we then have and assume row-exchangability on H 1 , i.e., we assume nE||(H 1 ) h || 2 = ||H 1 || F 2 , then Markov's inequality gives We now recall the following two observations • The optimization problem min • By theorem 2 of [19], there exists W ∈ R d×d orthogonal, such that Combining the two facts above, we conclude that ||UB − U B U B UB|| F 2 ≤ C n with high probability, as in Lemma A.6, hence which completes the proof. where is a covariance matrix depending on ( since (X − µ) has mean 0 and 11 2 ) ii (X − µ) i n→∞ − −− → 0, hence when n → ∞, the above expression yields: Condition on X i = x i , (6) is then the sum of n − 1 independent mean 0 random variables, each with covariance matrix given by: Consider var(E ij ||x i − X j || + E 2 ij − σ 2 2 ), 25 which can be written as: Finally, by the strong law of large numbers, we have W n S B W n n = 1 n X X→Ξ ∈ R d×d almost surely, hence, (nW n S B −1 W n )→Ξ −1 almost surely. Multivariate Slutsky's theorem then We can now prove Theorem 3.1: Proof. Note that X i = (Z i −Z)W , where W is an orthogonal matrix, that is X i 's are just centered Z i 's after some orthogonal transformation. Substituting X i 's with Z i 's on the right hand side of the equality in Lemma A.11, we get the desired result.