Eﬃcient block boundaries estimation in block-wise constant matrices: An application to HiC data

: In this paper, we propose a novel modeling and a new method- ology for estimating the location of block boundaries in a random matrix consisting of a block-wise constant matrix corrupted with white noise. Our method consists in rewriting this problem as a variable selection issue. A penalized least-squares criterion with an (cid:2) 1 -type penalty is used for dealing with this problem. Firstly, some theoretical results ensuring the consistency of our block boundaries estimators are provided. Secondly, we explain how to implement our approach in a very eﬃcient way. This implementation is available in the R package blockseg which can be found in the Comprehen- sive R Archive Network. Thirdly, we provide some numerical experiments to illustrate the statistical and numerical performance of our package, as well as a thorough comparison with existing methods. Fourthly, an empirical procedure is proposed for estimating the number of blocks. Finally, our approach is applied to HiC data which are used in molecular biology for better understanding the inﬂuence of the chromosomal conformation on the cells functioning.


Introduction
Detecting automatically the block boundaries in large block wise constant matrices corrupted with noise is a very important issue which may have several applications. One of the main situations in which this problem occurs is in the study of HiC data. It corresponds to one of the most recent chromosome conformation capture technologies that have been developed to better understand the influence of the chromosomal conformation on the cells functioning. This technology is based on a deep sequencing approach and provides read pairs corresponding to pairs of genomic loci that physically interacts in the nucleus, see [13] for more details. The raw measurements provided by HiC data are often summarized as a square matrix where each entry at row i and column j stands for the total number of read pairs matching in position i and position j, respectively, see [5] for further details. Positions refer here to a sequence of non-overlapping windows of equal sizes covering the genome.
Blocks of different intensities arise among this matrix, revealing interacting genomic regions among which some have already been confirmed to host coregulated genes. The purpose of the statistical analysis is then to provide a fully automated and efficient strategy to determine a decomposition of the matrix in non-overlapping blocks, which gives, as a by-product, a list of non-overlapping interacting chromosomic regions. In the following, our goal will thus be to design an efficient and fully automated method to find the block boundaries of nonoverlapping blocks in very large matrices which can be modeled as block wise constant matrices corrupted with white noise. As a natural extension to the twodimensional case, the positions in columns and rows of the block boundaries within the observation matrix will also be called change-points. For a more precise definition, we refer the reader to the beginning of Section 2.
An abundant literature is dedicated to the change-point detection issue for one-dimensional data both from a theoretical and practical point of view. From a practical point of view, the standard approach for estimating the change-point locations is based on least-square fitting, performed via a dynamic programming algorithm (DP). Indeed, for a given number of change-points K, the dynamic programming algorithm, proposed by [3] and [7], takes advantage of the intrinsic additive nature of the least-square objective to recursively compute the optimal change-points locations with a complexity of O(Kn 2 ) in time, see [1] and [11]. This complexity has recently been improved by [20] and [15] in some specific cases. Another very popular approach in the one-dimensional case is the Binary Segmentation method proposed by [22]. However, contrary to the DP approach, it does not necessary provide the optimal solution of the least-square minimization problem.
However, in general one-dimensional situations, the computational burden of the DP based methods is prohibitive to handle very large data sets. In this situation, [9] proposed to rephrase the change-point estimation issue as a variable selection problem. This approach has also been extended by [24] to find shared change-points between several signals. In the two-dimensional case, namely when matrices have to be processed, no method has been proposed, to the best of our knowledge, for providing the block boundaries of non overlapping blocks of very large n × n matrices. In order to be able to process observation matrices coming from HiC experiments, we aim at being able to handle 5000 × 5000 matrices, which corresponds to matrices having 2.5 × 10 7 entries. The only statistical approach proposed for retrieving such non-overlapping block boundaries in this two-dimensional framework is the one devised by [12] but it is limited to the case where the block wise matrix is assumed to be block wise constant on the diagonal and constant outside the diagonal blocks.
The difficulties that we have to face with in the two-dimensional framework are the following. Firstly, it has to be noticed that the classical dynamic programming algorithm cannot be applied in such a framework since the Markov property does not hold anymore. Secondly, the group-lars approach of [24] cannot be used in this framework since it would only provide change-points in columns and not in rows. Thirdly, although very efficient for image denoising, neither the generalized Lasso approach devised by [23] nor the fused Lasso signal approximator of [10], which are implemented in the R packages genlasso and flsa, respectively, give access to the boundaries of non-overlapping blocks of a noisy block wise constant matrix. This fact is illustrated in Figure 2. The first column of this figure contains the block wise constant matrix of Figure 1 corrupted with additional noise in high signal to noise ratio contexts. The denoising of these noisy matrices obtained by the packages genlasso and flsa is displayed in the second and third columns of Figure 1, respectively. Note that, for obtaining these results, we used the default parameters of these packages and for the parameter λ we used the one giving the denoised matrix being the closest to the original one in terms of recovered blocks.
In this paper, our goal is thus to design a statistical method for estimating the location of the boundaries of non-overlapping blocks from a block wise constant matrix corrupted with white noise. To the best of our knowledge, there is indeed no statistical procedure for answering this specific question in the literature that is both computationally and statistically efficient.
The paper is organized as follows. In Section 2, we first describe how to rephrase the problem of two-dimensional change-point estimation as a high di-  mensional sparse linear model and give some theoretical results which prove the consistency of our change-point estimators. In Section 3, we describe how to efficiently implement our method. Then, we provide in Section 4 experimental evidence of the relevance of our approach on synthetic data. We conclude in Section 6 by a thorough analysis of a HiC dataset.

Statistical modeling
In this section, we explain how the two-dimensional retrospective change-point estimation issue can be seen as a variable selection problem. Our goal is to estimate t 1 = (t 1,1 , . . . , t 1,K 1 ) and t 2 = (t 2,1 , . . . , t 2,K 2 ) from the random matrix with the convention t 1,0 = t 2,0 = 1 and t 1,K 1 +1 = t 2,K 2 +1 = n + 1. An example of such a matrix U is displayed in Figure 3. The entries E i,j of the matrix E = (E i,j ) 1≤i,j≤n are iid zero-mean random variables. With such a definition the Y i,j are assumed to be independent random variables with a blockwise constant mean. Let T be a n × n lower triangular matrix with nonzero elements equal to one and B a sparse matrix containing null entries except for the B i,j such that (i, j) ∈ {t 1,0 , . . . , t 1,K 1 } × {t 2,0 , . . . , t 2,K 2 }. Then, (2.1) can be rewritten as follows: where T denotes the transpose of the matrix T. For an example of a matrix B, see Figure 3. Let Vec(X) denotes the vectorization of the matrix X formed by stacking the columns of X into a single column vector then Vec(Y) = Vec(TBT ) + Vec(E). Hence, by using that Vec(AXC) = (C ⊗ A)Vec(X), where ⊗ denotes the Kronecker product, (2.2) can be rewritten as: where Y = Vec(Y), X = T ⊗ T, B = Vec(B) and E = Vec(E). Thanks to these transformations, Model (2.1) has thus been rephrased as a sparse high dimensional linear model where Y and E are n 2 × 1 column vectors, X is a n 2 × n 2 matrix and B is n 2 × 1 sparse column vectors. Multiple change-point estimation Problem (2.1) can thus be addressed as a variable selection problem: where u 2 2 and u 1 are defined for a vector u in R N by is related to the popular Least Absolute Shrinkage and Selection Operator (LASSO) in least-square regression. Thanks to the sparsity enforcing property of the 1 -norm, the estimator B of B is expected to be sparse and to have non-zero elements matching with those of B. Hence, retrieving the positions of the non zero elements of B thus provides estimators of (t 1,k ) 1≤k≤K 1 and of (t 2,k ) 1≤k≤K 2 . More precisely, let us define by A(λ n ) the set of active variables: For each j in A(λ n ), consider the Euclidean division of (j − 1) by n, namely (j − 1) = nq j + r j then where t 1,1 < t 1,2 < · · · < t 1,| A1(λn)| , t 2,1 < t 2,2 < · · · < t 2,| A2(λn)| . (2.5) In (2.5), | A 1 (λ n )| and | A 2 (λ n )| correspond to the number of distinct elements in {r j : j ∈ A(λ n )} and {q j : j ∈ A(λ n )}, respectively.
As far as we know, neither thorough practical implementation nor theoretical grounding have been given so far to support such an approach for change-point estimation in the two-dimensional case. In the following section, we give theoretical results supporting the use of such an approach.

Theoretical results
In order to establish the consistency of the estimators t 1 and t 2 defined in (2.5), we shall use assumptions (A1-A4). These assumptions involve the two following quantities which corresponds to the smallest length between two consecutive change-points and to the smallest jump size between two consecutive blocks, respectively.
(A1) The random variables (E i,j ) 1≤i,j≤n are iid zero mean random variables such that there exists a positive constant β such that for all ν in R, E[exp(νE 1,1 )] ≤ exp(βν 2 ).
(A2) The sequence (δ n ) is a non increasing and positive sequence tending to zero such that nδ n J min 2 / log(n) → ∞, as n tends to infinity.
The following proposition ensures that the distance between each estimated and true change-point is less than nδ n , where δ n is a non increasing and positive sequence tending to zero defined in (A2), (A3) and (A4), with a probability tending to 1, as n → ∞.
1≤i,j≤n be defined by (2.1) and t 1,k , t 2,k be defined by (2.5). Assume that (A1)-(A4) hold. Assume also that | A 1 (λ n )| = K 1 and that | A 2 (λ n )| = K 2 , with probabilty tending to one. Then, The proof of Proposition 1 is based on the two following lemmas. The first one comes from the Karush-Kuhn-Tucker conditions of the optimization problem stated in (2.4). The second one allows us to control the supremum of the empirical mean of the noise.

Lemma 2.
Let (Y i,j ) 1≤i,j≤n be defined by (2.1). Then, U = X B, where X and B are defined in (2.3) and (2.4) respectively, is such that where q j and r j are the quotient and the remainder of the Euclidean division of (j − 1) by n, respectively, that is (j − 1) = nq j + r j . In (2.7), sign denotes the function which is defined by sign( Moreover, the matrix U, which is such that U = Vec( U), is blockwise constant and satisfies (2.5).

j≤n be random variables satisfying (A1). Let also (v n )
and (x n ) be two positive sequences such that v n x 2 n / log(n) → ∞, then The proofs of Proposition 1, Lemmas 2 and 3 are given in Appendix A. Remark. If Y is a non square matrix having n 1 rows and n 2 columns, with n 1 = n 2 , the result of Proposition 1 remains valid if in Assumption (A2) δ n is replaced by δ n1,n2 satisfying n 1 δ n1,n2 J min 2 / log(n 2 ) → ∞ and n 2 δ n1,n2 J min 2 / log(n 1 ) → ∞, as n 1 and n 2 tend to infinity.

Implementation
In order to identify a series of change-points we look for the whole path of solutions in (2.4), i.e., {B(λ), λ min < λ < λ max } such that |Â(λ max )| = 0 and |Â(λ min )| = s with s a predefined maximal number of activated variables. To this end it is natural to adopt the famous homotopy/LARS strategy of [18,6]. Such an algorithm identifies in Problem (2.4) the successive values of λ that correspond to the activation of a new variable, or the deletion of one that became irrelevant. However, the existing implementations do not apply here since the size of the design matrix X -even for reasonable n -is challenging both in terms of memory requirement and computational burden. To overcome these limitations, we need to take advantage of the particular structure of the problem. In the following lemmas (which are proved in Appendix A), we show that the most involving computations in the LARS can be made extremely efficiently thanks to the particular structure of X .

Lemma 4.
For any vector v ∈ R n 2 , computing X v and X v requires at worse 2n 2 operations. Lemma 5. Let A = {a 1 , . . . , a K } and for each j in A let us consider the Euclidean division of j − 1 by n given by j − 1 = nq j + r j , then Remark. We were able to obtain a closed-form expression of the inverse (X A X A ) −1 for some special cases of the subset A, namely, when the quotients/ratios associated with the Euclidean divisions of the elements of A are endowed with a particular ordering. Moreover, for addressing any general problem, we rather solve systems involving X A X A by means of a Cholesky factorization which is updated along the homotopy algorithm. These updates correspond to adding or removing an element at a time in A and are performed efficiently as stated in Lemma 6. These lemmas are the building blocks for our LARS implementation given in Algorithm 1, where we detail the leading complexity associated with each part. The global complexity is in O(mn 2 + ms 2 ) where m is the final number of steps in the while loop. These steps include all the successive additions and deletions needed to reach s, the final targeted number of active variables. At the end of day, we have m block wise predictionŶ associated with the series of m estimations ofB(λ). The above complexity should be compared with the usual complexity of the LARS algorithm, when no particular structure is at play in Problem (2.4): in such a case, a implementation of the LARS as in [2] would be at least in O(mn 4 + ms 2 ).

// Compute the direction step
Find the maximal step preserving equicorrelation Find the maximal step preserving the signs γout ← min In all our experiments the number of steps m required to reach a given number of change-points s was always of the same order as s, as argued in the original LARS paper of [6]. However, there is no theoretical guarantee for m to be small: Indeed, it is possible to build artificial designs where m is equal to s 3 , see [16]. Mairal and Yu in [16] propose a slight modification of the LARS to overcome this issue, that can be applied to our implementation.
Concerning the memory requirements, we only need to store the n × n data matrix Y once. Indeed, since we have at our disposal the analytic form of any sub matrix extracted from X X , we never need to compute neither store this large n 2 × n 2 matrix. This paves the way for quickly processing data with thousands of rows and columns.

Simulation study
In this Section, we conduct a set of simulation studies to assess the performances of our proposal. First, we report the computational performances of Algorithm 1 and of its practical implementation in terms of timings. Second, we report the statistical performances of our estimators (2.5) for recovering the true changepoints by means of Receiver Operating Characteristic (ROC) curves.

Data generation
All synthetic data are generated from Model (2.1). We control the computational difficulty of the problem by varying the sample size n. The statistical difficulty is controlled by varying σ, the standard deviation of the Gaussian noise E. We chose different patterns for the true matrix U designed to mimic the variety of block matrix structures met in Hi-C data. These patterns are obtained by changing the parameters μ k, s, each of whom controlling the intensity in block (k, ) of U . We consider four different scenarios, all with K 1 = 4 change-points along the rows and K 2 = 4 change-points along the columns.
(4.1) Examples of matrices Y are displayed in Figure 4 for these four scenarios, with n = 100 and σ = 1 which corresponds to a relatively small level of noise in this problem. The first (μ , (1) k, ) corresponds to a "checkerboard-shaped" matrix, that is, a natural two dimensional extension of a one dimensional piece-wise constant problem.
The second (μ , (2) k, ) defines a block diagonal model that mimics the cisinteractions in the human Hi-C experiments: these are the most usual interactions found in the cell, which occur between nearby elements along the genome.
The third (μ , k, ) and fourth (μ , (4) k, ) configurations describe more complex patterns that can be found when trans-interactions occur in Hi-C experiments. They also correspond to more difficult change-points problems.

Competitors and implementation details
In our experiments, we compare our methodology with popular methods for segmentation and variable selection that we adapted to the specific problem of two-dimensional change-points detection: 1. First, we adapt Breiman et al.'s classification and regression trees [4] (hereafter called CART) by using the successive boundaries provided by CART as change-points for the two-dimensional data. We use the implementation provided by the publicly available R package rpart. 2. Second, we adapt Harchaoui and Lévy-Leduc's method [9] (hereafter HL), which is the exact one-dimensional counterpart of our approach. To analyse two-dimensional data, we apply this procedure to each row of Y in order to recover the change-points of each row. The change-points appearing in the different rows are claimed to be change-points for the twodimensional data either if they appear at least in one row (variant HL1) or if they appear in ([n/2] + 1) rows (variant HL2). This approach is fitted by solving n Lasso problems (one per row of Y) by means of the R package glmnet. 3. Third, we consider an adaptation of the fused-Lasso (hereafter FL2D). Indeed, as illustrated in the introduction, the basic 2-dimensional fused-Lasso for signal approximator is not tailored for recovering change points. We thus consider the following variant, which applied a fused-Lasso penalty on the following linear model: where 1 n (resp. 0 n ) is a size-n column vector of ones (resp. zeros), I n a n × n-diagonal matrix of ones and Y, E are defined as in Equation (2.3).
The FL2D method detects a change-point in columns (resp. in row) if two successive values β To solve this problem, we must fit a general fused-Lasso problem. We rely on the R package genlasso for this task. 4. Finally, our own procedure, that we call blockseg, is implemented in the R package blockseg which is available from the Comprehensive R Archive Network (CRAN, [19]). Most of the computation are performed in C++ using the library armadillo for linear algebra [21].
In what follows, all experiments were conducted on a Linux workstation with Intel Xeon 2.4 GHz processor and 8 GB of memory.

Numerical performances
We start by presenting in Figure 5 the computational times for 100 runs of each method applied to a matrix drawn from the "checkerboard" scenario, with n = 100 and σ = 5. Each run provides the estimated change-points in rows and columns for all the possible number of change-points in rows and in columns, that is all the values between 1 and n.
Independent of its statistical performance, we can see on this small problem that the adaptation of the fused-Lasso cannot be used for analyzing real Hi-C problems. On the other hand, our modified CART procedure is extremely fast. However, we will see that its statistical performances are quite poor. Finally, our implementation blockseg is quite efficient as it clearly outperforms HL. This should be emphasized since blockseg is a two-dimensional method dealing with data with size n 2 , while HL is a 1-dimensional approach that addresses two univariate problems of size n.
We now consider blockseg on its own in order to study the scalability of our approach regarding the problem dimension. To this end, we generated "checkerboard" matrix μ ,(1) k, given in (4.1) with various sizes n (from 100 to 5000) and various values of the maximal number of activated variables s (from 50 to 750). The median runtimes obtained from 4 replications (+ 2 for warm-up)  are reported in Figures 6. The left (resp. the right) panel gives the runtimes in seconds as a function of s (resp. of n). These results give experimental evidence for the theoretical complexity O(mn 2 + ms 2 ) that we established in Section 3 and thus for the computational efficiency of our approach: applying blockseg to matrices containing 10 7 entries takes less than 2 minutes for s = 750.

Statistical performances
We evaluate the performance of the different competitors for recovering the true change-points in the 4 scenarios defined in Section 4.1 for an increasing level of difficulty. We draw 1000 datasets for each scenario for a varying level of noise σ ∈ {1, 2, 5, 10} and for a problem size of n = 100. Note that we use this relatively small problem size to allow the comparison with methods HL and FL2D that would not work for greater values of n. Figure 7 shows the results in terms of receiver operating characteristic (ROC) curves for recovering the change-points in rows, averaged over the 1000 runs. Similar results hold for the change-points in columns. This Figure exhibits the very good performance of our method, which outperforms its competitors by retrieving the change-points with a very small error rate even in high noise level frameworks. Moreover, our method seems to be less sensitive to the block pattern shape in matrix U than the other ones. In order to further assess our approach we give in Figure 8 the boxplots of the Area Under Curve (AUC) for the different ROC curves. We also give in Table 1 the mean of the AUC and the associated standard deviation.
In order to further compare the different approaches we generated matrices Y satisfying Model (2.1) with a "checkerboard" matrix μ ,(1) k, given in (4.1) for n ∈ {50, 100, 250}. We observe from Table 2 that the performance of our   method are on a par with those of FL2D for n = 50 and 100. However, for n = 250 the computational burden of FL2D is so large that the results are not available, see the blue crosses in Table 2. The AUC are also displayed with boxplots in Figure 9.

Model selection
In the previous experiments we did not need to explain how to choose the number of estimated change-points since we used ROC curves for comparing the methodologies. However, in real data applications, it is necessary to propose a methodology for estimating the number of change-points. This is what we explain in the following.
In practice, we take s = K 2 max where K max is an upper bound for K 1 and K 2 . For choosing the final change-points we shall adapt the well-known stability selection approach devised by [17]. More precisely, we randomly choose M times n/2 columns and n/2 rows of the matrix Y and for each subsample we select s = K 2 max active variables. Finally, after the M data resamplings, we keep the change-points which appear a number of times larger than a given threshold. By the definition of the change-points given in (2.5), a change-point t 1,k or t 2, may appear several times in a given set of resampled observations. Hence, the To evaluate the performances of this methodology, we generated observations according to the "checkerboard" model defined in (2.1) with (μ , (1) k, ) defined in (4.1), s = 225 and M = 100. The results are given in Figure 10 which displays the score associated to each change-point for a given matrix Y. We can see from this figure that there are some spurious change-points close to the true changepoint positions. In order to identify the most representative change-point in a given neighborhood, we keep the one with the largest score among a set of contiguous candidates. The result of such a post-processing is displayed in the first row of Figure 11. More precisely the boxplots associated to the estimation of K 1 (resp. the histograms of the estimated change-points in rows) are displayed for different values of σ and different thresholds expressed as a percentage of the largest score. We can see from these figures that when the threshold is in the interval [20,40] the number and the location of the change-points are very well estimated even in the high noise level case.
In order to further assess our methodology including the post-processing step and to be in a framework closer to our real data application, we generated observations following (2.1) with n = 1000 and K 1 = K 2 = 100 where we used for the matrix U the same shape as the one of the matrix (μ , (1) k, ) except  that K 1 = K 2 = 100. In this framework, the proportion of change-points is thus ten times larger than the one of the previous case. The corresponding results are displayed in Figures 12 and 13. We can see from the last figure that taking a threshold equal to 20% provides the best estimations of the number and of the change-point positions. This threshold corresponds to the lower bound of the thresholds interval obtained in the previous configuration. Our package blockseg provides an estimation of the matrix U for any threshold given by the user as we shall explain in the next section.

Application to HiC data
In this section, we apply our methodology to publicly available HiC data (http://chromosome.sdsc.edu/mouse/hi-c/download.html) already studied by [5]. This technology is based on a deep sequencing approach and provides read pairs corresponding to pairs of genomic loci that physically interacts in the nucleus, see [13] for more details. The raw measurements provided by HiC data is therefore a list of pairs of locations along the chromosome, at the nucleotide resolution. These measurement are often summarized as a square matrix where each entry at row i and column j stands for the total number of read pairs matching in position i and position j, respectively. Positions refer here to a sequence of non-overlapping windows of equal sizes covering the genome. The number of windows may vary from one study to another: [13] considered a Mb resolution, whereas [5] went deeper and used windows of 40kb (called hereafter the resolution).
In our study, we processed the interaction matrices of Chromosomes 1 and 19 of the mouse cortex at a resolution 40 kb and we compared the number and the location of the estimated change-points found by our approach with those obtained by [5] on the same data since no ground truth is available. More precisely, in the case of Chromosome 1, n = 4930 and in the case of Chromosome 19, n = 1534.
Let us first give the results obtained by using our methodology. Figure 14 displays the change-point locations obtained for the different values of the threshold used in our adaptation of the stability selection approach and defined in Section 5. The corresponding estimated matrices Y = U for Chromosome 1 and 19 are displayed in Figure 15 when the thresholds are equal to 10, 15 and 20%, which correspond to the red horizontal levels in Figure 14.
In order to compare our approach with the technique devised by [5], we display in Figure 16 the number of change-points in rows found by our methodology as a function of the threshold and a red line corresponding to the number of change-points found by [5]. Note that we did not display the change-points in columns in order to save space since they are similar.
We also compute the two parts of the Hausdorff distance for the change-points in rows which is defined by where t and t B are the change-points in rows found by our approach and [5], respectively. In (6.1),  [5].
More precisely, Figure 17 displays the boxplots of the d 1 and d 2 parts of the Hausdorff distance without taking the supremum in orange and blue, respectively. We can observe from Figure 17 that some differences indeed exist between the segmentations produced by the two approaches but that the boundaries of the blocks are quite close when the number of estimated change-points are the same, which is the case when thresh = 1.8% (left) and 10% (right).  In the case where the number of estimated change-points are on a par with those of [5], we can see from Figure 18 that the change-points found with our strategy present a lot of similarities with those found by the HMM based approach of [5]. However, contrary to our method, the approach of [5] can only deal with binned data at the resolution of several kilobases of nucleotides. The very low computational burden of our strategy paves the way for processing data collected at a very high resolution, namely at the nucleotide resolution, which is one of the main current challenges of molecular biology.

Conclusion
In this paper, we proposed a novel approach for retrieving the boundaries of a block wise constant matrix corrupted with noise by rephrasing this problem as a variable selection issue. Our approach is implemented in the R package blockseg which is available from the Comprehensive R Archive Network (CRAN).
In the course of this study, we have shown that our method has two main features which make it very attractive. Firstly, it is very efficient both from the theoretical and practical point of view. Secondly, its very low computational burden makes its use possible on very large data sets coming from molecular biology.
However, in view of applying our approach to HiC data experiments, it could be interesting to develop a methodology which could deal with change-points that do not affect all rows and columns simultaneously. This will be the subject of a future work.

A.1. Proofs of statistical results
Proof of Lemma 2. A necessary and sufficient condition for a vector B in R n 2 to minimize the function Φ defined by: is that the zero vector in R n 2 belongs to the subdifferential of Φ at B that is: Using that X By (A1) and the Markov inequality, we get that for all positive η, By taking η = x n /(2β), we get that Since the same result is valid for −E n,j , we get that which concludes the proof of Lemma 3.
Using similar arguments, we can prove that P(A n,k ∩ D (r) n ) → 0, which concludes the proof of Proposition 1.

A.2. Proofs of computational lemmas
Proof of Lemma 4. Consider X v for instance (the same reasoning applies for X v): we have X v = (T ⊗ T)v = Vec(TVT ) where V is the n × n matrix such that Vec(V) = v. Because of its triangular structure, T operates as a cumulative sum operator on the columns of V. Hence, the computations for the jth column is done by induction in n operations. The total cost for the n columns of TV is thus n 2 . Similarly, right multiplying a matrix by T boils down to perform cumulative sums over the rows. The final cost for X v = Vec(TVT ) is thus 2n 2 in case of a dense matrix V, and possibly less when V is sparse. and * denotes the Khatri-Rao product, which is defined as follows for two n × n matrices A and B where the a i (resp. b i ) are the columns of A (resp. B). Using (25) of Theorem 2 in [14], we get that where • denotes the Hadamard or entry-wise product. Observe that by definition of T, (T •,Q A T •,Q A ) k, = n − (q a k ∨ q a ) and (T •,R A T •,R A ) k, = n − (r a k ∨ r a ). By (A.6), X X A,A is a Gram matrix which is positive and definite since the vectors T •,qa 1 +1 ⊗ T •,ra 1 +1 , T •,qa 2 +1 ⊗ T •,ra 2 +1 , . . . , T •,qa K +1 ⊗ T •,ra K +1 are linearly independent.
Proof of Lemma 6. The operations of adding/removing a column to a Cholesky factorization are classical and well treated in books of numerical analysis, see e.g. [8]. An advantage of our settings is that there is no additional computational cost for computing X X j when entering a new variable j thanks to the closedform expression (3.1).