Random Consensus Robust PCA

This paper presents r2pca , a random consensus method for robust principal component analysis. r2pca takes ransac ’s principle of using as little data as possible one step further. It iteratively selects small subsets of the data to identify pieces of the principal components, to then stitch them together. We show that if the principal components are in general position and the errors are suﬃ-ciently sparse, r2pca will exactly recover the principal components with probability 1, in lieu of assumptions on coherence or the distribution of the sparse errors, and even under adversarial settings. r2pca enjoys many advantages: it works well under noise, its computational complexity scales linearly in the ambient dimension, it is easily parallelizable, and due to its low sample complexity, it can be used in settings where data is so large it cannot even be stored in memory. We complement our theoretical ﬁndings with synthetic and real data experiments showing that r2pca outperforms state-of-the-art methods in a broad range of settings.


Introduction
In many relevant applications one aims to find a lowdimensional subspace that approximates a large data matrix M. Principal Component Analysis (PCA) is one of the most widely used techniques for this purpose.Unfortunately, a single grossly corrupted datum can severely compromise its performance.Hence there is a wide variety of approaches to make PCA robust.Examples include M-estimators [1], random sampling [2], influence function techniques [3], alternating minimization [4] and convex relaxations [5][6][7][8][9][10].Other ap-proaches use convex optimization methods on subsets of the data (e.g., full rows and full columns) to improve computational complexity [11,12].
One of the most natural and widely used algorithms for robust estimation is random sample consensus (ransac) [2].ransac is simple yet powerful.It is popular because it makes almost no assumptions about the data, and it does not require unrealistic conditions to succeed.It has theoretical guarantees, works well in practice, and has enjoyed many improvements since its inception, e.g., [13][14][15].The ransac version of PCA iteratively selects a few columns in M to define a candidate subspace, until it identifies a subspace that agrees with other columns in M.This will successfully identify the subspace that we are looking for, as long as M has enough uncorrupted columns.
But in many modern applications, such as image processing and networks data analysis, every column in M may have a few grossly corrupted entries.This makes all columns in M outliers, hence standard robust methods like ransac are no longer applicable.In this setting M can be better modeled as the sum of a low-rank matrix L and a sparse matrix S representing the corrupted entries.The goal is to identify the subspace U spanned by the columns in L. This problem is often called robust PCA (RPCA) [6].The last decades have seen great approaches to this problem [16], yet it remained unclear how to extend ransac's principles to this setting [3].
The main contribution of this paper is r2pca: a random consensus algorithm for RPCA.r2pca takes ransac's principle of using as little data as possible one step further.It iteratively selects small subsets of the entries in M to identify pieces of the subspace U .This process is repeated until we identify enough pieces to stitch them together and recover the whole subspace.The key idea behind r2pca is that subspaces can be easily and efficiently recovered from a few of its canonical projections [17].These canonical projections are the pieces that r2pca aims to identify.See Figure 1 for some intuition.
Our main result shows that r2pca will exactly recover the subspace that we are looking for with probabil-Figure 1: First: Each column in a rank-r matrix L corresponds to a point in an r-dimensional subspace U .In these figures, U is a 1-dimensional subspace (line) in general position, and the columns in L are drawn generically from U , that is, independently according to an absolutely continuous distribution with respect to the Lebesgue measure on U .For example, according to a gaussian distribution on U .Second: Adding S equates to corrupting some coordinates of some columns in L. In this figure each point is corrupted in only one coordinate.As a result, the corrupted points no longer lie in U .So, how can we identify U from these corrupted points?Third: The key idea behind r2pca is that since errors are sparse, if we only focus on a few coordinates of a few columns at a time, it is likely that the selected columns are uncorrupted on the selected coordinates.We can verify whether this is the case, because the projections of the selected columns onto the selected coordinates will agree if and only if the columns are uncorrupted in these coordinates.In this illustration, r2pca only focused on the (x, y) coordinates and on two columns.The projections of both columns onto the (x, y) coordinates agree.Namely, they both lie in U ω 1 .Hence we can be sure that the (x, y) coordinates of these columns are uncorrupted, and that U ω 1 is actually equal to the projection of the subspace U that we aim to identify.Last: We can repeat this procedure for different sets of coordinates and columns, until we obtain enough projections to reconstruct the whole subspace.
ity 1, as long as M is generic, and the corrupted entries are sufficiently sparse.In contrast to popular convex relaxation methods (e.g., [5][6][7][8][9][10][11][12]), our results make no assumptions about coherence or the distribution of the sparse errors.In fact, our results hold even under adversarial settings where the errors are purposely located to complicate success.In its noisy variant, r2pca can consistently estimate the desired subspace within the noise level.The computational complexity of r2pca scales linearly in the ambient dimension.In addition, r2pca enjoys many of ransac's advantages, and many of ransac's improvements can be easily adapted to r2pca.For instance, r2pca can run in parallel, with different computers searching for different pieces (canonical projections) of the subspace.This can greatly reduce computation time, which is of great interest in general, and paramount in realtime applications.Furthermore, r2pca's principle of studying subspaces by pieces also improves computational and sample complexity.This is because r2pca only uses small subsets of the data at a time.This is of particular importance in settings where M is so large it cannot even be stored in memory.We complement our theoretical findings with synthetic and real data experiments showing that r2pca outperforms stateof-the-art methods, both in terms of speed and accuracy, in a broad range of settings.

Model and Main Results
Suppose we observe a d × n data matrix M, given by where L is rank-r and S is sparse.The goal is to identify the r-dimensional subspace U spanned by the columns of L, or slightly more demanding, determine S and L. Consider the following assumptions, where Gr(r, R d ) denotes the Grassmannian manifold of rdimensional subspaces in R d , and • 0 denotes the 0 -norm, given by the number of nonzero entries.
(A1) U is drawn according to an absolutely continuous distribution with respect to the uniform measure over Gr(r, R d ).
(A2) The columns of L are drawn independently according to an absolutely continuous distribution with respect to the Lebesgue measure on U .
(A3) The nonzero entries in S are drawn independently according to an absolutely continuous distribution with respect to the Lebesgue measure on R S 0 .
(A4) S has at most n−r 2(r+1) α nonzero entries per row and at most A1 requires that U is a subspace in general position, and A2 requires that the columns in L are drawn generically from this subspace.Together, A1 and A2 require that L is a generic rank-r matrix.Similarly, A3 requires that S is a generic sparse matrix.See Section 4 for a further discussion about our assumptions and their relation to other typical assumptions from the literature.
A4 requires that M has at most O( n /r α ) corrupted entries per row and at most O( d /r α−1 ) corrupted entries per column.Notice that since decomposing M is the same as decomposing M T , assumption A4 can be interchanged in terms of rows and columns.A4 is a reasonable assumption because in most problems where this setting arises, S is sparse and r d, n, whence A4 allows a large number of corrupted entries in M. The parameter α determines the sparsity level of S, which in turn determines the computational complexity of r2pca.The larger α, the sparser S, and the faster r2pca will succeed.
The main result of this paper is the next theorem.It states that r2pca (Algorithm 1 below) will exactly recover U , L and S on O((d + n)2 r 2−α ) iterations (linear in d and n).In the extreme case where S has too many errors (α = 1), r2pca could require exponential time in r.But if S is sufficiently sparse (α ≥ 2), r2pca will succeed in linear time in r.This is true even in the adversarial setting where the errors are purposely located to complicate success.In other words, the computational complexity in Theorem 1 considers the worst-case scenario.So in practice, as shown in our experiments, r2pca can be much faster and allow a much larger number of corrupted entries.The proof is given in Appendix A.
Theorem 1.Let A1-A4 hold.Let Û, L and Ŝ be the output of r2pca.Then span{ Û} = U , L = L and Ŝ = S with probability 1.Furthermore, the expected number of iterations required by r2pca is upper bounded by Remark 1.Throughout the paper we assume that the rank r is known.If this is not the case, r can be estimated by iteratively selecting (τ + 1) × (τ + 1) minors in M, and verifying their rank (or their singular value decomposition in the noisy setting).Under A1-A4, with probability 1 there will be a (τ + 1) × (τ + 1), rank-τ minor in M if and only if r = τ .So we can start with τ = 1.If we find a 2 × 2 minor in M, we know that r = 1.If there exists no such minor, we know that r ≥ 2. We can iteratively repeat this process until we find a (τ + 1) × (τ + 1) minor of rank-τ .A1-A4 imply that if r = τ , then for every ω ⊂ {1, . . ., d} with τ + 1 entries, M ω will contain a (τ + 1) × (τ + 1), rank-τ minor.Furthermore, using the same reasoning as in the proof of Theorem 1, one can show that on expectation, it would take no more than O(2 r 2−α ) trials to find such a minor (recall that α ≥ 1 determines the sparsity level in S).

Algorithm
In this section we introduce r2pca in its most basic setting (Algorithm 1).In Section 5 we discuss how to generalize it to noisy settings.From a high level perspective, r2pca searches for small subsets of uncontaminated data in M to obtain pieces of U to then stitch them together.Once U is known, r2pca searches for a few uncontaminated entries in each column of M to recover the coefficients of L. Once L is known, S can be trivially recovered through (1).
The key idea behind r2pca is that subspaces can be exactly and efficiently recovered from a few of its canonical projections [17].So rather than attempting to identify U directly, we will aim to identify small projections of U , knowing that there is a simple way to stitch these projections together to recover U .More precisely, let Ω be a d × N matrix with exactly r + 1 nonzero entries per column, and let ω i ⊂ {1, 2, . . ., d} index the nonzero entries in the i th column of Ω. ω i indicates the canonical coordinates involved in the i th projection that we will aim to identify.For any subspace, matrix or vector that is compatible with ω i , we will use the subscript ω i to denote its restriction to the coordinates/rows in ω i .For example, M ωi ∈ R (r+1)×n denotes the restriction of M to the rows in ω i , and U ωi ⊂ R r+1 denotes the projection of U onto the coordinates in ω i .
Our goal is to identify a collection of projections {U ωi } N i=1 such that U can be recovered from these projections.Whether this is the case depends on the ω i 's, i.e., on Ω.Fortunately, Theorem 1 in [17] specifies the conditions on Ω to guarantee that U can be recovered from {U ωi } N i=1 .To present this result, let us introduce the matrix A. A1 implies that with probability 1, U ωi is a hyperplane, i.e., an r-dimensional subspace in R r+1 .As such, it is characterized by its orthogonal direction, which we will call a ωi .More precisely, let a ωi ∈ R r+1 be a nonzero vector in ker U ωi , and let a i be the vector in R d with the entries of a ωi in the locations of ω i , and zeros elsewhere.Let A be the d × N matrix formed with {a i } N i=1 as columns.This way, A encodes the information of the projections {U ωi } N i=1 .With this, we are ready to present Theorem 1 in [17], which we restate here as Lemma 1 with some adaptations to our context.Lemma 1 (Theorem 1 in [17]).Let A1 hold.With probability 1, U = ker A T if and only if (i) There is a matrix Ω formed with d−r columns of Ω, such that every matrix formed with a subset of η columns in Ω has at least η + r nonzero rows.
There exist plenty of matrices Ω satisfying (i).For example: where 1 and 0 denote blocks of all 1's and all 0's.One may easily verify that Ω satisfies condition (i) by taking Ω = Ω.Notice that A is sparse (it only has r+1 nonzero entries per column), so computing ker A T can be done efficiently.
Lemma 1 implies that N = d − r projections are necessary and sufficient to recover U .Furthermore, it tells us which projections to look for, and how to reconstruct U from these projections.Hence, our strategy will be to select a d × (d − r) matrix Ω satisfying (i), then identify the projections {U ωi } d−r i=1 , and finally construct A according to these projections to recover U as ker A T .
In principle, our strategy to identify each projection is very similar to ransac.Given ω i , we iteratively select r + 1 columns of M ωi uniformly at random.Let M ωi ∈ R (r+1)×(r+1) be the matrix formed with the selected columns.span{M ωi } defines a candidate projection of U onto ω i .A1-A3 imply that with probability 1, span{M ωi } = U ωi if and only if M ωi has no corrupted entries.This will be the case if and only if rank(M ωi ) = r.We will thus verify whether rank(M ωi ) = r.If this is not the case, this candidate projection will be discarded, and we will try a different M ωi .On the other hand, if rank(M ωi ) = r, then we know that span{M ωi } is the projection U ωi that we were looking for.In this case, we can construct a i as before.This process is repeated for each column ω i in Ω to obtain A. Since Ω satisfies (i), we know by Lemma 1 that at the end of this procedure we will have enough projections to reconstruct U as dim ker A T .At this point, we have already recovered U .Let U ∈ R d×r be a basis of U .We will now estimate the coefficient matrix Θ ∈ R r×n such that L = UΘ.Let m be a column in M, and let ω be a subset of {1, 2, . . ., d} with exactly r + 1 elements.A1-A3 imply that with probability 1, m ω will lie in U ω if and only if all entries in m ω are uncorrupted.We will thus iteratively select a set ω indexing r + 1 random entries in m until we find an ω such that m ω ∈ U ω .Once we find such ω, the coefficient vector of the corresponding column in L will be given by θ := (U T ω U ω ) −1 U T ω m ω .This process will be repeated for every column in M to obtain the coefficient matrix Θ, which together with U determine L as UΘ.Once L is known, one can trivially recover S through (1).r2pca is summarized in Algorithm 1.  14

More about our Assumptions
Essentially, A1-A3 require that M is a combination of a generic sparse matrix, and a generic low-rank matrix.This discards pathological cases, like matrices with identical columns or exact-zero entries.Examples of these cases could arise in unnatural, cartoonlike images.
However, A1-A3 allow realistic cases, like natural images.For instance, backgrounds in natural images can be highly structured but are not perfectly constant, as there is always some degree of natural variation that is reasonably modeled by an absolutely continuous (but possibly highly inhomogeneous) distribution.For example, the sky in a natural image might be strongly biased towards blue values, but each sky pixel will have at least small variations that will make the sky not perfectly constant blue.So while these are structured images,these variations make them generic enough so that our theoretical results are applicable.This is confirmed in our real data experiments.
Furthermore, because absolutely continuous distributions may be strongly inhomogeneous, they can be used to represent highly coherent matrices (that is, matrices whose underlying subspace is highly aligned with the canonical axes).Previous theory and methods for RPCA cannot handle some of the highly coherent cases that our new theory covers and that our new algorithm handles well (as demonstrated in our experiments).
We point out that A1-A3 do not imply coherence nor vice-versa.For example, coherence assumptions indeed allow some identical columns, or exact-zero entries.However, they rule-out cases that our theory allows.For example, consider a case where a few rows of U are drawn i.i.d.N(0, σ 2 1 ) and many rows of U are drawn i.i.d.N(0, σ 2 2 ), with σ 1 σ 2 .This is a good model for some microscopy and astronomical applications that have a few high-intensity pixels, and many low-intensity pixels.Such U would yield a highly coherent matrix, which existing theory and algorithms cannot handle, while our results can (this can be confirmed in our experiments).
To sum up, our assumptions are different, not stronger nor weaker than the usual coherence assumptions, and we believe they are also more reasonable in many practical applications.

Handling Noise
In practice, all entries in M may be noisy, even the ones that are not corrupted by gross errors.We can model this as where L and S are as before, and W represents a noise matrix.The goal is the same as before: determine L and S from M.
Recall that r2pca's goal is to identify the projections {U ωi } d−r i=1 to then reconstruct U .In the noiseless setting, we do this by iteratively selecting (r + 1) × (r + 1) matrices M ωi , and checking their rank.If rank(M ωi ) = r, then we know that all entries in M ωi are uncorrupted, whence U ωi is given by span{M ωi }.But in the noisy setting, rank(M ωi ) = r + 1 in general, regardless of whether these columns are corrupted by gross errors.Hence we cannot determine directly whether the columns in M ωi are uncorrupted by simply checking whether rank(M ωi ) = r, as we did in the noiseless setting.Instead, we can check the (r + 1) th singular value of M ωi , which we will denote as λ r+1 .If λ r+1 is above the noise level, it is likely that at least one entry in M ωi is grossly corrupted.On the other hand, if λ r+1 is within the noise level, it is likely that M ωi has no grossly corrupted entries, whence we can use the subspace spanned by the r leading singular vectors of M ωi as an estimator of U ωi .Unfortunately, since M ωi only has r + 1 rows and columns, the singular values and vectors of M ωi will have a large variance.This means that λ r+1 will be above the noise level for many uncorrupted matrices M ωi , and below the noise level for many corrupted ones.As a result, we could miss good estimators and use bad ones.Furthermore, even if M ωi is uncorrupted, the subspace spanned by its r leading singular vectors could be far from U ωi .As a result, our estimate of U could be very inaccurate.
But this can be improved if we use a few more entries in M so that the noise cancels out.To this end, let κ i be a subset of {1, 2, . . ., d} with k > r elements containing ω i , and let M κi ∈ R k×k be a matrix formed with k columns of M κi .Define V κi ∈ R k×r as the matrix formed with the r leading left singular vectors of M κi .Under mild, typical assumptions (e.g., finite second and fourth moments), if M κi is uncorrupted, as k grows, the (r + 1) th singular value of M κi converges to the noise level, and span{V κi } converges to U κi .In other words, the larger k, the better estimates of U we will obtain.On the other hand, as k grows, it is more likely that at least one entry in M κi is grossly corrupted (because M κi will have more entries, each of which may be grossly corrupted), contrary to what we want.We thus want k to be large enough such that M κi can be used to accurately estimate U κi , but not so large that there are no matrices M κi with uncorrupted entries.The fraction of corrupted entries in M determines how large k can be.Figure 3 in Section 6 shows the feasible range of k as a function of the fraction of corrupted entries in M.
Since M κi has k > r rows, if M κi is believed to be uncorrupted, we can use it to estimate several projections of U (as many as k − r).To see this, let υ i be a subset of ω i with exactly r elements.Let j ∈ κ i \υ i , and let ω ij := υ i ∪ j.Observe that, V ωij ∈ R (r+1)×r gives an estimate of U ωij through span{V ωij }.As before, we will store this information in the matrix A. More precisely, for each j ∈ κ i \υ i , we will take a nonzero vector a ωij ∈ ker V T ωij , and we will construct the vector a ij ∈ R d with the entries of a ωij in the locations of ω ij .This time, A will be the matrix with the a ij 's as columns.Since ω i = ω ij for some j, Lemma 1 suggests that the projections encoded in A should be enough to reconstruct U. We can thus use the matrix Û ∈ R d×r formed with the last r left singular vectors of A (which approximates ker A T ) to estimate of U .
Similarly, in the second part of r2pca we can estimate the coefficients of L using k entries of each column in M.More precisely, for each column m in M, we can iteratively select a set κ indexing k random entries in m until we find a κ such that m κ is close to span{ Ûκ } (within the noise level).If this is the case, it is likely that the entries in m κ are uncorrupted, and that θ = ( ÛT κ Ûκ ) −1 ÛT κ m κ is a good estimate of the coefficient we are looking for.We repeat this process to obtain an estimate Θ of Θ.Finally, our estimate of L is given by Û Θ, which in turn gives an estimate of S through (1).The noisy variant of r2pca is summarized in Algorithm 2 in Appendix B.

Experiments
In this section we present a series of experiments to study the performance of r2pca and compare it with the augmented Lagrange multiplier method for robust PCA(RPCA-ALM) [18,19].We found, consistent with previous reports [9,16,20], that the RPCA-ALM algorithm typically performed as well or better than several others (e.g., singular value thresholding [7], alternating direction method [21], accelerated proximal gradient [22] and dual method [22]).

Synthetic Data.
We will first use simulations to study the performance of r2pca in the noiseless setting, as a function of the percentage of grossly corrupted entries per row p, and the coherence of L, defined as µ := , where P U denotes the projection operator onto U , and e i the i th canonical vector in R d .Intuitively, µ parametrizes how aligned is U with respect to the canonical axes.In all our experiments, L was a d × n, rank-r matrix, with d = n = 100 and r = 5.
In our simulations, we first generated a d × r random matrix U with N(0, 1) i.i.d.entries to use as basis of U .To obtain matrices with a specific coherence parameter, we simply increased the magnitude of a few entries in U, until it had the desired coherence.We then generated an r × (r + 1)(d − r) random matrix Θ, also with N(0, 1) i.i.d.entries, to use as coefficient vectors.With this, we constructed L = UΘ.Next, we generated a d × r matrix S with p percent of nonzero entries per row, selected uniformly at random.The nonzero entries in S are i.i.d.N(0, 10).Finally, we obtained M as in (1).We repeated this experiment 100 times for each pair (p, µ), and recorded the fraction of trials that L was exactly recovered.We declared a success if the normalized error (using Frobenius norm) was below 10 −10 after at most 10 3 seconds.The results are summarized in Figure 2. As predicted by our theory, r2pca performs perfectly as long as S is sufficiently sparse, regardless of coherence.In contrast, other approaches rely on low coherence (e.g., [5][6][7][8][9][10][11][12]), and one can see that their performance quickly decays as coherence increases.On the other hand, as p grows, and S becomes less sparse, the likelihood of finding uncorrupted blocks in M quickly decays.In turn, it takes r2pca (this paper) RPCA-ALM Notice that as p grows, so does the time required to find projections, up to the point where r2pca is unable to find enough projections to reconstruct U .Theorem 1 shows that if M has at most p = 7.9% corrupted entries per row (dashed line), then r2pca can exactly recover L. We point out that our results hold regardless of coherence, as opposed to other algorithms, whose performance quickly decays as coherence increases.
more time to identify projections of U , up to the point where r2pca is unable to identify enough projections to reconstruct U .Our astronomy and real-data experiments below illustrate the importance of coherent matrices and non uniformly distributed errors.
In Section 5 we presented a noisy variant of r2pca that iteratively selects k rows of k columns of M to estimate U and Θ.Its performance depends on the choice of k.If k is too small, our estimates could be very inaccurate.If k is too large, M may not contain enough k×k uncorrupted blocks to obtain an estimate.The feasible range of k depends on the percentage of corrupted entries p.In our next experiment we study the performance of the noisy variant of r2pca as a function of p and k.To obtain a noisy M according to (2), we generated matrices L and S as described before, and then added a d × n random matrix W with N(0, σ 2 ) i.i.d.entries.To measure accuracy we recorded the error of the estimated L after at most 10 3 seconds (using normalized Frobenius norm).We repeated this experiment 100 times for each pair (p, k) with σ = 10 −3 fixed.The results, summarized in Figure 3, show the feasible range of k.
In our next simulation, we selected k = 2r, known from our previous experiment to produce reasonable results for a wide range of p, and used it to test the perfor- mance of r2pca as a function of noise and coherence, with fixed p = 5%.We repeated this experiment 100 times for each pair (σ, µ).The results, summarized in Figure 4, show that r2pca can consistently estimate L within the noise level, as long as S is sufficiently sparse, regardless of coherence (as opposed to other algorithms).
Astronomy and Correlated Errors.In a video, the background can be modeled as approximately lowrank, and the foreground objects (like cars or people) can be modeled as sparse errors (as they typically take only a fraction of each frame).So the sparse plus lowrank model is a natural fit for this problem.Here M is the matrix containing the vectorized images in the video, and the goal is to decompose it into the sparse foreground S and the low-rank background L. We now r2pca (this paper) RPCA-ALM  present an experiment where highly coherent matrices and highly correlated sparse errors arise in a very natural way: background segmentation of astronomy videos.
In this experiment we simulated astronomy videos with a background with ν twinkling stars and ρ moving objects (see Figure 5).For each of the ρ moving objects, we selected uniformly at random: one starting point on the edge of a 90×120 frame, one starting time between {1, 2, . . ., 100}, one direction, and one speed ranging from 1 to 5 pixels per frame.With this information, we created ρ objects moving across a dark background over 100 frames.Each moving object consisted of an r × r block with N(0, 1) entries.We vectorized the frames to obtain a 10800 × 100 matrix, whose entries we normalized between 0 and 1 to obtain S. Finally, we replaced the zero entries in S with the corresponding entries in L to obtain M.This way, all entries in M are between 0 and 1, such that the brightest star shines at a maximum intensity of 1, and so does the brightest pixel of all moving objects.
We repeated this experiment 100 times for each pair r2pca (this paper) RPCA-ALM (ν, ρ).The results, summarized in Figure 6, show that r2pca has almost perfect performance handling highly coherent matrices and highly correlated errors (as opposed to other algorithms).
Real Data: Microscopy and Surveillance.Finally, we evaluate the background segmentation performance of r2pca on real data.To this end we used several microscopy videos from the Internet [25], and videos from two widely used datasets: the Wallflower dataset [23] and the I2R dataset [24].Figure 7 shows several examples, with more in Appendix C.
We point out that many cases of the Wallflower and the I2R datasets have low coherence.In these cases, the performance of r2pca and RPCA-ALM is very similar.Consistent with our theory, the advantage of r2pca becomes more evident in highly coherent cases, like our microscopy and astronomy experiments.
Remark 2. Notice that in all of our background experiments, r2pca can handle a much larger fraction of gross errors than the allowed by Theorem 1.This is because Theorem 1 holds even under the worst-case scenario where the errors are purposely located to complicate success.In many applications, as in background segmentation, errors are often grouped, which tends to leave more uncorrupted blocks.This facilitates r2pca's success.

Conclusions
In this paper we present r2pca, a novel algorithm for robust PCA.We show that under reasonable assumptions, r2pca will succeed with probability 1 in linear time, in lieu of assumptions on coherence or the distribution of the sparse errors.The algorithm is parallelizable and can be used in large scale settings where the dataset is too large to even store in memory.Our experiments show that r2pca consistently outperforms state-of-the-art methods both in terms of speed and accuracy in a broad range of settings, particularly on high coherence cases.
Original Frame r2pca (this paper) RPCA-ALM Background Foreground Background Foreground and RPCA-ALM [18,19].Notice that the background obtained by RPCA-ALM contains foreground objects, while the background obtained by r2pca is much cleaner.This is because it these videos the background is mostly dark with a few bright regions (which implies a highly coherent subspace) and the location of the errors is highly correlated (the location of an object in consecutive frames is very similar).In contrast to other optimization methods [5-12, 18, 19], we make no assumptions about coherence or the distribution of the sparse errors, and so this does not affect our results.

Figure 2 :
Figure 2: Transition diagrams of the success rate (top row) and time (bottom row) for exact recovery of L as a function of the percentage of grossly corrupted entries per row p, and the coherence parameter µ ∈ [1, d /r].The color of each (p, µ) pixel indicates the average over 100 trials (the lighter the better).Notice that as p grows, so does the time required to find projections, up to the point where r2pca is unable to find enough projections to reconstruct U .Theorem 1 shows that if M has at most p = 7.9% corrupted entries per row (dashed line), then r2pca can exactly recover L. We point out that our results hold regardless of coherence, as opposed to other algorithms, whose performance quickly decays as coherence increases.

Figure 3 :
Figure 3: Transition diagram of the estimation error of L (using the noisy variant of r2pca) as a function of the percentage of grossly corrupted entries per row p and the parameter k, with noise level σ = 10 −3 .The color of each (p, k) pixel indicates the average over 100 trials (the lighter the better).If k is too small, our estimate could be very inaccurate.If k is too large, it is less likely to find k × k uncorrupted matrices to obtain an estimate.This figure shows the feasible range of k (white region, where r2pca can recover L within the noise level), which depends on the percentage of corrupted entries p.

Figure 4 :
Figure 4: Transition diagram of the estimation error of L as a function of the noise level σ and the coherence parameter µ, with p = 5% grossly corrupted entries.The color of each (σ, µ) pixel indicates the average error over 100 trials (the lighter the better).This shows that r2pca can consistently estimate L within the noise level, as long as S is sufficiently sparse, regardless of coherence.Other algorithms can also estimate L within the noise level, but only for a restricted range of matrices with bounded coherence.

Figure 5 :
Figure 5: Left: One frame of a simulation of an astronomical video, composed of a background formed with ν twinkling stars and ρ moving objects.Each object (block) moves in a random direction at a random speed over the video.Right: Each frame is vectorized to form the matrix M, which is shown negated and transposed for display purposes (i.e., we see 1 − M T ).The vertical lines correspond to the pixels of the stars.All other points correspond to the moving objects.These points are highly correlated, as is the location of an object in consecutive frames.
To this end we first generated a d × r matrix U, and an r × N matrix Θ, with d = 90 • 120 = 10800, r = 5 and N = 100.We selected ν rows in U uniformly at random, and populated them with the absolute values of i.i.d.N(0, 100) random variables.These entries represent the twinkling stars.All other entries in U were populated with the absolute values of i.d.d.N(0, 1) random variables.Similarly, we populated Θ with the absolute value of i.d.d.N(0, 1) random variables.Next we constructed L = UΘ, and bounded its entries by 1 (i.e., we divided L by its maximum value).Each column of L represents the vectorized background of a 90 × 120 frame of a video.

Figure 6 :
Figure 6: Transition diagrams of the success rate (top row) and time (bottom row) for exact recovery of L as a function of the number of moving objects ρ and the number of twinkling stars ν for the experiment described in Figure 5.The larger ρ, the larger proportion of corrupted entries p.The larger ν, the lower coherence µ.The color of each (ρ, ν) pixel indicates the average over 100 trials (the lighter the better).The results are consistent with the experiments in Figure 2.

Figure 7 :
Figure 7: Sparse (foreground) plus low-rank (background) decomposition of several microscopy videos[25] using r2pca and RPCA-ALM[18,19].Notice that the background obtained by RPCA-ALM contains foreground objects, while the background obtained by r2pca is much cleaner.This is because it these videos the background is mostly dark with a few bright regions (which implies a highly coherent subspace) and the location of the errors is highly correlated (the location of an object in consecutive frames is very similar).In contrast to other optimization methods[5-12, 18, 19], we make no assumptions about coherence or the distribution of the sparse errors, and so this does not affect our results.