Factor Modelling for Clustering High-dimensional Time Series

We propose a new unsupervised learning method for clustering a large number of time series based on a latent factor structure. Each cluster is characterized by its own cluster-specific factors in addition to some common factors which impact on all the time series concerned. Our setting also offers the flexibility that some time series may not belong to any clusters. The consistency with explicit convergence rates is established for the estimation of the common factors, the cluster-specific factors, the latent clusters. Numerical illustration with both simulated data as well as a real data example is also reported. As a spin-off, the proposed new approach also advances significantly the statistical inference for the factor model of Lam and Yao (2012).


Introduction
One of the primary tasks of data mining is clustering.While most clustering methods are originally designed for independent observations, clustering a large number of time series gains increasing momentum (Esling and Agon 2012), due to mining large and complex data recorded over time in business, finance, biology, medicine, climate, energy, environment, psychology, multimedia and other areas (Table 1 of Aghabozorgi et al. 2015).Consequently, the literature on time series clustering is large; see Liao (2005), Aghabozorgi et al. (2015), Maharaj et al. (2019) and the references therein.The basic idea is to develop some relevant similarity or distance measures among time series first, and then to apply the standard clustering algorithms such as hierarchical clustering or k-means method.Most existing similarity/distances measures for time series may be loosely divided into two categories: databased and feature-based.The data-based approaches define the measures directly based on observed time series using, for example, L 2 -or, more general, Minkowski's distance, or various correlation measures.Alone and Peña (2019) proposed a generalized cross correlation as a similarity measure, which takes into account cross correlation over different time lags.
Dynamic time warping can be applied beforehand to cope with time deformation due to, for example, shifting holidays over different years (Keogh and Ratanamahatana, 2005).The feature-based approaches extract relevant features from observed time series data first, and then define similarity/distance measures based on the extracted features.The feature extraction can be carried out by various transformations such as Fourier, wavelet or principal component analysis (Section 2.3 of Roelofsen, 2018).The features from fitted time series models can also be used to define similarity/distance measures (Yao et al. 2000, Frühwirh-Schnatter andKaufmann 2008).Attempts have also been made to define the similarity between two time series by measuring the discrepancy between the two underlying stochas-tic processes (Kakizawa et al. 1998, Khaleghi et al. 2016).Other approaches include Zhang (2013) which clusters time series based on the parallelism of their trend functions, and Ando and Bai (2017) which represents the latent clusters in terms of a factor model.So-called 'subsequence clustering' occurs frequently in the literature on time series clustering; see Keogh and Lin (2005), and Zolhavarieh et al. (2014).It refers to clustering the segments from a single long time series, which is not considered in this paper.
The goal of this study is to propose a new factor model based approach to cluster a large number of time series into different and unknown clusters such that the members within each cluster share a similar dynamic structure, while the number of clusters and their sizes are all unknown.We represent the dynamic structures by latent common and cluster-specific factors, which are both unknown and are identified by the difference in factor strength.The strength of a factor is measured by the number of time series which influence and/or are influenced by the factor (Chamberlain and Rothschild 1983).The common factors are strong factors as each of them carries the information on most (if not all) time series concerned.The cluster-specific factors are weak factors as they only affect the time series in a specific cluster.
Though our factor model is similar to that of Ando and Bai (2017), our approach is radically different.First, we estimate strong factors and all the weaker factors in the manner of one-pass, and then the latent clusters are recovered based on the estimated weak factor loadings.Ando and Bai (2017) adopted an iterative least squares algorithm to estimate factors/factor loadings and latent cluster structure recursively.Secondly, our setting allows the flexibility that some time series do not belong to any clusters, which is often the case in practice.Thirdly, our setting allows the dependence between the common factors and cluster-specific factors while Ando and Bai (2017) imposed an orthogonality condition between the two; see Remark 1(iv) in Section 2 below.
The methods used for estimating factors and factor loadings are adapted from Lam and Yao (2012).Nevertheless substantial advances have been made even within the context of Lam and Yao (2012): (i) we remove the artifact condition that the factor loading spaces for strong and weak factors are perpendicular with each other, (ii) we allow weak serial correlations in idiosyncratic components in the model, which were assumed to be vector white noise by Lam and Yao (2012), and, more significantly, (iii) we propose a new and consistent ratio-based estimator for the number of factors (see Step 1 and also Remark 3(iii) in Section 3 below).
The rest of the paper is organized as follows.Our factor model and the relevant conditions are presented in Section 2. We elaborate explicitly why it is natural to identify the latent clusters in terms of the factor strength.The new clustering algorithm is presented in Section 3. The clustering is based on the factor loadings on all the weak factors; applying a K-means algorithm using a correlation-type similarity measure defined in terms of the loadings.The asymptotic properties of the estimation for factors and factor loadings are collected in Section 4. Section 5 presents the consistency the proposed factor-based clustering algorithm with error rates.Numerical illustration with both simulated and a real data example is reported in Section 6.We also provide some comments in Section 7. All technical proofs are presented to a supplementary.
We always assume vectors in column.Let a denote the Euclidean norm of vector a.
For any matrix G, let M(G) denote the linear space spanned by the columns of G ≡ (g i,j ), G the square root of the largest eigenvalue of G ⊤ G, G min the square root of the smallest eigenvalue of G ⊤ G, |G| the matrix with |g i,j | as its (i, j)-th element.We write a ≍ b if a = O(b) and b = O(a).We use C, C 0 to denote generic constants independent of p and n, which may be different at different places.

Models
Let {y t } 1≤t≤n be a weakly stationary p × 1 vector time series, i.e.Ey t is a constant independent of t, and all elements of Cov(y t+k , y t ) are finite and dependent on k only.
Suppose that y t consists of d + 1 latent segments, i.e. where p d+1 ≥ 0, and Furthermore, we assume the following latent factor model with d clusters: where A is a p × r 0 matrix with rank r 0 , x t is r 0 -vector time series representing r 0 common factors and |Var(x t )| = 0, B j is p j × r j matrix with rank r j , z t,j is r j -vector time series representing r j factors for y t,j only and |Var(z t,j )| = 0, 0 stands for a p d+1 × r matrix with all elements equal to 0, r = r 1 + • • • + r d , and ε t is an idiosyncratic component in the sense of Chamberlain (1983) and Chamberlain and Rothschild (1983) (see below).Note that in the model above, we only observe permuted y t (i.e. the order of components of y t is unknown) while all the terms on the RHS of (2) are unknown.
By (2), the p 0 components of y t are grouped into d clusters y t,1 , • • • , y t,d , while the p d+1 components of y t,d+1 do not belong to any clusters.The j-th cluster y t,j is characterized by the cluster-specific factor z t,j , in addition to the dependence on the common factor x t .
The goal is to identify those d latent clusters from observations y 1 , • • • , y n .Note that all p j , r j and d are also unknown.
We always assume that the number of the common factors and the number of clusterspecific factors for each cluster remain bounded when the number of time series p diverges.
This reflects the fact that the factor models are only appealing when the numbers of factors are much smaller than the number of time series concerned.Furthermore, we assume that the number of time series in each cluster p i diverges at a lower order than p and the number of clusters d diverges as well.See Assumption 1 below.
where C > 0 and δ ∈ (0, 1) are constants independent of n and p.
The strength of a factor is measured by the number of time series which influence and/or are influenced by the factor.Each component of x t is a common factor.It is related to most, if not all, components of y t in the sense that the most elements of the corresponding column of A (i.e. the factor loadings) are non-zero.Hence it is reasonable to assume where a j is the j-th column of A. This is in the same spirit of the definition for the common factors by Chamberlain and Rothschild (1983).Denoted by b j i the i-th column of the p j × r j matrix B j .In the same vein, we assume that as each cluster-specific factor for the j-th cluster is related to most of the p j ≍ p 1−δ (Assumption 1) time series in the cluster.Note that the factor strength can be measured by constant δ ∈ [0, 1]: δ > 0 in (4) indicates that factors z t = (z t,1 , • • • , z t,d ) are weaker than factors x t which corresponds to δ = 0; see (3).
Conditions (3) and ( 4) are imposed under the assumption that all the factors remain unchanged as p diverges, all the entries of covariance matrices below are bounded, and, furthermore, Σ x (k) and Σ z (k) are full-ranked for k = 0, 1, • • • , k 0 , where k 0 ≥ 1 is an integer.Then condition (3) and ( 4) are equivalent to ( 6) and ( 7) in Assumption 3 below after the orthogonal normalization to be introduced now in order to make model (2) partially identifiable and operationally tractable.
In model (2) A, B, x t and z t are not uniquely defined, as, for example, (A, x t ) can be replaced by (AH, H −1 x t ) for any r 0 × r 0 invertible matrix H.We argue that this lack of uniqueness gives us the flexibility to choose appropriate A and B to facilitate our estimation more readily.Assumption 2 below specifies both A and B to be half-orthogonal in the sense that the columns of A or B are orthonormal, which can be fulfilled by, for example, replacing the original (A, x t ) by (H, Vx t ), where A = HV is a QR decomposition of A. Even under Assumption 2, A and B are still not unique.In fact that only the factor loading spaces M(A), M(B i ) are uniquely defined by (2).Hence i.e. the projection matrix onto M(A), is also unique.
Assumption 2. A ⊤ A = I r 0 , B ⊤ j B j = I r j for 1 ≤ j ≤ d, and it holds for a constant q 0 ∈ (0, 1) that Furthermore for j = 1, • • • , d, rp(B j ){rp(B j )} ⊤ cannot be written as a block diagonal matrix with at least two blocks, where rp(B j ) denotes any row-permutation of B j .
Condition (5) implies that the columns of ( B 0 ) do not fall entirely into the space M(A) as otherwise one cannot distinguish z t from x t .It is automatically fulfilled if A ⊤ ( B 0 ) = 0 which is a condition imposed in Lam and Yao (2012).Finally the last condition in Assumption 2 ensures that the number of clusters d is uniquely defined.
Assumption 3. Let y t , x t and z t be strictly stationary with the finite fourth moments.As Furthermore, y t is ψ-mixing with the mixing coefficients satisfying t≥1 tψ(t) 1/2 < ∞, and Cov(x t , ε s ) = 0, Cov(z t , ε s ) = 0 for any t and s.
Remark 1. (i) The factor strength is defined in terms of the orders of the factor loadings in (3) and (4).Due to the orthogonalization specified in Assumption 2, they are transformed into the orders of the covariance matrices in ( 6) and (7).See also Remark 1 in Lam and Yao (2012).Nevertheless, the factor strength is still measured by the constant δ ∈ [0, 1]: the smaller δ is, the stronger a factor is.The common factors in x t are the strongest with δ = 0, and the cluster-specific factors in z t are weaker with δ ∈ (0, 1).In (2) ε t represents the idiosyncratic component of y t in the sense that each component of ε t only affects the corresponding component and a few other components of y t (i.e.δ = 1), which is implied by Assumptions 4 below.Hence the strength of ε t is the weakest.The differences in the factor strength make x t , z t and ε t on the RHS of (2) (asymptotically) identifiable.To simplify the presentation, we assume that all the components of z t are of the same strength (i.e.all p i are of the same order).See the real data example in Section 6.2 for how to handle the cluster-specific factors of different strengths.
(ii) Model ( 2) is similar to that of Ando and Bai (2017).However, we do not require that the common factor x t and the cluster-specific factor z t are orthogonal with each other in the sense that 1 n 1≤t≤n x t z ⊤ t = 0, which is imposed by Ando and Bai (2017).Furthermore, we allow the idiosyncratic term ε t to exhibit weak autocorrelations (Assumption 4 below), instead of complete independence as in Ando and Bai (2017).
We now impose some structure assumptions on the idiosyncratic term ε t in model ( 2).
Assumption 4. Let ε t = Ge t , where G is a p × p constant matrix with G bounded from above by a positive constant independent of p. Furthermore, one of the following two conditions holds.
) ⊤ , and η t,i being i.i.d.across t and i with mean 0, variance 1 and E(η 4 t,i ) < ∞.
(ii) e t = (e t,1 , • • • , e t,p ) ⊤ consists of p independent weakly stationary univariate time series, E(e t ) = 0, and Remark 2. In Assumption 4, (i) assumes that e t is a linear process with the same serial correlation structure across all the components.(ii) allows some non-linear serial dependence, the dependence structures for different components may differ.But then the sub-Gaussian condition (10) is required.

A clustering algorithm
With available observations y 1 , • • • , y n , we propose below an algorithm (in five steps) to identify the latent d clusters.To this end, we introduce some notation first.Let ȳ = where k 0 ≥ 0 is a pre-specified integer in Assumption 3.
Step 1 (Estimation for the number of factors.)For 0 We say that R s attains a local maximum if The estimators for the numbers of factors are then defined as Step 2 (Estimation for the loadings for common factors.)Let γ 1 , • • • , γ p be the orthonormal eigenvectors of matrix M, arranged according to the descending order of the corresponding eigenvalues.The estimated loading matrix for the common factors is Step 3 (Estimation for the loadings for cluster-specific factors.
Step 4 (Identification for the components not belonging to any clusters.
denote the row vectors of B. Then the identified index set for the components of y t not belonging to any clusters is where ω p > 0 is a constant satisfying the conditions and Step Perform below.Note that Lam and Yao (2012) shows that their estimator r 0 fulfills the relation (ii) The intuition behind the estimators in ( 12) is that the eigenvalues , where Σ y (k) = Cov(y t+k , y t ), satisfy the conditions This is implied by the differences in strength among the common factor x t , the cluster specific factors z t,i , and the idiosyncratic components ε t ; see Theorem 3. Note that we use the ratios of the cumulative eigenvalues in ( 12) in order to add together the information from different lags k.In practice, we set k 0 to be a small integer such as k 0 ≤ 5, as the significant autocorrelation occurs typically at small lags.The results do not vary that much with respect to the value of k 0 (see the simulation results in Section 6.1 below).We truncate the sequence { R j } at J 0 to alleviate the impact of '0/0'.In practice, we may set Example 1.Consider a simple model of the form (2) in which where a 1 , a 2 , a 3 are constants, and u i,t , for different i, t, are independent and N(0, 1).Let M = 0≤k≤1 Σ y (k)Σ y (k) ⊤ , and λ 1 ≥ λ 2 ≥ λ 3 be the three largest eigenvalues of M. It can be shown that . This shows that r 0 (= 1) or r(= 2) cannot be estimated stably based on the ratios of the eigenvalues of M for this example.In fact, let two p × 3 matrices U 0 and U 1 be the eigenvectors of Σ y (0)Σ y (0) ⊤ and Σ y (1)Σ y (1) ⊤ .
4 Asymptotic properties on estimation for factors Assumption 2 ensures that the rank of matrix B * ≡ (I p − AA ⊤ ) B 0 is r.Denote by is a consistent estimator, see Theorem 2 below, and also Remark 3(iv). and Theorem 3 below specifies the asymptotic behavior for the ratios of the cumulated eigenvalues used in estimating the numbers of factors in Step 1 in Section 3 above.It implies that r 0 → r 0 , r → r in probability provided that J 0 > r 0 + r is fixed.12), it holds for some constant C > 0 that lim n,p→∞ lim n,p→∞ where s is a positive fixed integer.
Remark 4. It is worth pointing out that the block diagonal structure of B is not required for Theorems 1-3.On the other hand, if A ⊤ B 0 = 0 and {x t } and {z t } are independent, the term p −δ/2 on the RHS of ( 17)-( 19) disappears.

Asymptotic properties on clustering
Assumption 5.The elements of AA ⊤ are of the order O(p −1 ), and b i 2 ≍ p δ−1 for 1 ≤ i ≤ p 0 , where b i denotes the i-th row of matrix B.
The orthogonalization A ⊤ A = I r 0 implies that the average of the squared elements of A is O(p −1 ).Since r 0 is finite, it is reasonable to assume that the elements of AA ⊤ are O(p −1 ).As B is a block-diagonal matrix with blocks {B j } and B ⊤ j B j = I r j , the squared elements of B j are of the order p −1 j ≍ p −(1−δ) in average.As r j is bounded, it is reasonable to assume b i 2 ≍ p δ−1 .Assumption 5 ensures that (I p −AA ⊤ ) B 0 is asymptotically a block diagonal matrix; see Theorem 4 below.This enables to recover the block diagonal structure of B based on B which provides a consistent estimator for the space M (I p − AA ⊤ ) B 0 (Theorem 2 above), and also to separate the components of y t not belonging to any clusters.
Theorem 4. Let Assumptions 1 -2 and 5 hold.Divide matrix B 0 (B ⊤ , 0) − P A ⊥ B into (d + 1) × (d + 1) blocks, and denote its (i, j)-th block of the size p i × p j by V i,j .Then as Theorem 5. Let the conditions of Theorem 2 and Assumption 5 hold.For J d+1 defined in ( 16), it holds that where ω p is given in ( 16).Furthermore, Theorem 6.Let the conditions of Theorem 5 hold, and p δ r log 2 n = o(n).Then P ( d ≥ d) → 1, as n, p → ∞.
Remark 5. Theorem 5 shows that most the components belonging to the d clusters will not be classified as not belonging to any clusters (see ( 24)).Furthermore most the components not belonging to any clusters will be correctly identified (see ( 25)).Theorem 6 shows that the probability of under-estimating d converges to 0.
To investigate the errors in the K-means clustering, let R = (|r ℓ,m |) be the p 0 × p 0 matrix with We assume that d is known.Let O d be the set consisting of all p 0 × p 0 matrices with d distinct rows.Put For any p 0 -vector g with its elements taking integer values between 1 and d, let For some constant c > 0, min Theorem 7. Let the conditions of Theorem 5 and Assumption 6 hold.The number of clusters d is assumed to be known.Denoted by τ the number of misclassified components of y t by the K-means clustering in Step 5. Then as n, p → ∞, Remark 6. Theorem 7 implies that the misclassification rate of the K-means method converges to 0, though the convergence rate is slow (see ( 27)).However a faster rate is attained when A ⊤ B 0 = 0, and {x t } and {z t } are independent, as then τ /p = O p n −1/2 r 1/2 .See also Remark 4. Assumption 6 requires that R − D 2 F increases as the number of misplaced members of this partition τ (g) becomes large, which is necessary for the K-means method.
6 Numerical properties

Simulation
We illustrate the proposed methodology through a simulation study with model (2).We draw the elements of A and B j independently from U(−1, 1).All component series of x t and z t are independent and AR(1) and MA(1), respectively, with Gaussian innovations.
All components of ε t are independent MA(1) with N(0, 0.25) innovations.All the AR and the MA coefficients are drawn randomly from U{(−0.95, −0.4) ∪ (0.4, 0.95)}.The standard deviations of the components of x t and z t are drawn randomly from U(1, 2).
The numbers of factors r 0 and r are estimated based on the ratios R j in (13) with k 0 = 1, • • • , 5 and J 0 = [p/4] in (12).For the comparison purpose, we also report the estimates based on the ratios of eigenvalues of M in (11) also with k 0 = 0, • • • , 5, which is the standard method used in literature and is defined as in ( 13) but now with R j = λ j / λ j+1 instead, where λ 1 ≥ • • • ≥ λ p ≥ 0 are the eigenvalues of M. See, e.g.Lam and Yao (2012).
For each setting, we replicate the experiment 1000 times.
Table 1: The relative frequencies of r 0 = r 0 and r 0 + r = r 0 + r in a simulation for Scenario I with 1000 replications, where r 0 and r are estimated by the R j -based method (13), and the ratios of the eigenvalues of M. The relative frequencies of r 0 = r 0 and r 0 + r = r 0 + r are reported in Tables 1-2.
Overall the method based on the ratios of the cumulative eigenvalues R j provides accurate and robust performance and is not sensitive to the choice of k 0 .The estimation based on the eigenvalues of M with k ≥ 1 is competitive for r 0 , but is considerably poorer for r 0 + r in Scenario II.Using M with k = 0 leads to weaker estimates for r 0 in Scenario I.
It is noticeable that the performance of the estimation for the number of common factor r 0 in Scenario II is better than Scenario I.This is due to the fact the difference in the factor strength between the common factor x t and the cluster-based factor z t in Scenario II is  The relative frequencies of r 0 = r 0 and r 0 + r = r 0 + r in a simulation for Scenario II with 1000 replications, where r 0 and r are estimated by the R j -based method (13), and the ratios of the eigenvalues of M. Estimation ), and B is estimated in the similar manner.Both r 0 and r are assumed to be known.
Estimation Step 2 of the algorithm stated in Section 3. See also Step 3 there for the similar procedure in estimating B. For the comparison purpose, we also include the estimates obtained with M replaced by Σ y (k) Σ y (k) ⊤ with k = 0, 1, • • • , 5. Tables 3-4 show clearly that the estimation based on M is accurate and robust with respect to the different values of k 0 .
Furthermore using a single-lagged covariance matrix for estimating factor loading spaces is not recommendable.The error A A ⊤ − AA ⊤ F in Scenario I is larger than the error in Scenarios II.This is due to the larger sample size n in Scenario II.See Theorem 1.In contrast, Theorem 2 shows that the error rate of B B ⊤ −P A⊥B contains the term p δ/2 n −1/2 .
While n is larger in Scenario II, so is p δ .This explains why the error B B ⊤ − P A⊥B F in Scenario II is also larger than that in Scenario I.
In the sequel, we only report the results with r 0 and r estimated by ( 13), and the factor loading spaces estimated by the eigenvectors of M. We always set k 0 = 5.We examine now the effectiveness of Step 4 of the algorithm.Note that the indices of the components Table 5: The means and standard deviations (in parentheses) of the error rates | in a simulation for Scenario I with 1000 replications with the 3 possible choices of threshold ωp in ( 16), and the numbers of factors r 0 and r either known or to be estimated.
Step 1 is to estimate the numbers of strong factors and cluster-specific weak factors.
To this end, we calculate R j as in ( 12) with k 0 = 5.It turns out R 1 = 32.53 is much larger than all the others, while R j for j ≥ 2 are plotted in Figure 1.By (13), r 0 = 1 and r 0 + r = 4.Note that the estimates for r 0 and r 0 + r are unchanged with k 0 = 1, • • • , 4.
While the existence of r 0 = 1 strong and common factor is reasonable, it is most unlikely that there are merely r = 3 cluster-specific weak factors.Note that estimators in (13) are derived under the assumption that all the r cluster-specific (i.e.weak) factors are of the same factor strength; see Remark 1(ii) in Section 2 above.In practice weak factors may have different degrees of strength; implying that we should also take into account the 3rd, the 4th, the 5th largest local maximum of R j .Hence we take r 0 + r = 16 (or perhaps also 10 or 13), as Figure 1  element, where n i is the number of the stocks in the i-th industry sector, and n ij is the number of the stocks in the i-th industry sector which are allocated in the j-th cluster.
Thus  the different factor strengths, i.e. common factors are strong with δ = 0, and cluster-specific factors are weak with δ > 0. However if, for example, one of the common factors has the same strength as the cluster-specific factors, the number of the strong factor is then r 0 − 1 and the number of the weak factors is r + 1.In this case, the estimated r 0 , r and the factor loading spaces will all be wrong.Nevertheless a common factor has non-zero loadings on the most components of y t , hence those loadings must be extremely small in order to be a weak factor.Therefore its impact on the estimation of the number of clusters, and the misclassification rates is minor.Simulation results in Supplementary Material support this assertion.
Heterogenous factor strength.We assume two factor strengths: p for common fac- Estimation for r j and B j .When we cluster the time series correctly in Steps 4-5, we obtain the estimator for B j from B directly.We can also run Step 1 on each cluster to estimate r j .Theorem 3 ensures the consistency of those estimates.Although Theorem 7 ensures that most of time series can be clustered correctly, there may be a cluster obtained ) Replace y t by (I p − A A ⊤ )y t in (11), and repeat the eigenanalysis as in Step 2 above but now denote the corresponding orthonormal eigenvectors by ζ 1 , • • • , ζ p .The estimated loading matrix for the cluster-specific factors is 5 (K-means clustering.)Denote by d the number of eigenvalues of | B B ⊤ | greater than 1 − log −1 n, which is taken as an upper bound of the number of clusters.Let p 0 = p − | J d+1 |, and F be the p 0 × r matrix obtained from B by removing the rows with their indices in J d+1 .Let f 1 , • • • , f p 0 denote the p 0 rows of F. Let R be the p 0 × p 0 matrix with the (ℓ, m)-th element the K-means clustering (with L 2 -distance) for the p 0 rows of R to form the d clusters, where d ≤ d is chosen such that the within-cluster-sum of L 2 -distances (to the cluster center points) are stabilized.Remark 3. (i) The ratio-based estimation inStep 1 is new.By Theorem 3 in Section 4 below, it holds r 0 → r 0 and r → r in probability.The existing approaches use the ratios of the ordered eigenvalues of matrix M instead(Lam and Yao 2012, Chang et al. 2015, Li et al. 2017); leading to an estimator which may not be consistent.See Example 1 4 or p/3.(iii) Step 3 removes the common factors first before estimating B, as Lam and Yao (2012) showed that weak factors can be more accurately estimated by removing strong factors from the data first.(iv) Once the numbers of strong and weak factors are correctly specified, the factor loading spaces are relatively easier to identify.In fact M( A) is a consistent estimator for M(A).However M( B) is a consistent estimator for M{(I p − AA ⊤ )( B 0 )} instead of M{( B 0 )}.See Theorem 2 in Section 4 below.Furthermore the last p d+1 rows of (I p − AA ⊤ )( B 0 ) are no longer 0. Nevertheless when the elements in AA ⊤ and BB ⊤ have different orders, those p d+1 zero-rows can be recovered from B in Step 4. See Theorem 5 in Section 5 below.(v) Given the block diagonal structure of B in (2), the d clusters would be identified easily by taking the (i, j)-th element of |BB ⊤ | as the similarity measure between the ith and the j-th components, or by simply applying the K-means method to the rows of |BB ⊤ |.However applying the K-means method directly to the rows of B will not do.Theorem 2 and Theorem 4 indicate that the block diagonal structure, though masked by asymptotically diminishing 'noise', still presents in B B ⊤ via a latent row-permutation of B. Accordingly the cluster analysis in Step 5 is based on the absolute values of the correlation-type measures among the rows of F F ⊤ which is an estimator for BB ⊤ .(vi) In Step 5, we search for the number of clusters d by the 'elbow method' which is the most frequently used method in K-means clustering.Nevertheless d provides an upper bound for d; see Theorem 6 below.Our empirical experiences indicate that d = d holds often especially when r j , 1 ≤ j ≤ d, are small.See Tables 7 and 8 in Section 6.1 below.Note that BB ⊤ is a block diagonal matrix with d blocks and all the non-zero eigenvalues equal to 1. Therefore the dominant eigenvalue for each of the latent d blocks in B B ⊤ is greater than or at least very close to 1.Moreover, by Perron-Frobenius's theorem, the largest eigenvalue of |B j B ⊤ j |, i.e. the so-called Perron-Frobenius eigenvalue, is strictly greater than the other eigenvalues of |B j B ⊤ j | under the last condition in Assumption 2. This is the intuition behind the definition of d.
Theorem 1 and Remark 4 below show that in the absence of weak factor z t , the estimation for the strong factor loading space M(A) achieves root-n convergence rate in spite of diverging p.Since only the factor loading space M(A) is uniquely defined by (2) (see the discussion below Assumption 2), we measure the estimation error in terms of its (unique) projection matrix AA ⊤ .Theorem 1.Let Assumptions 1-4 hold.Let p, n → ∞, n = O(p) and p δ = o(n).Then it holds that

O
d (g) = {D ∈ O d : two rows of D are the same if and only if the corresponding two elements of g are the same}.Note that the d distinct rows of D 0 would be the centers of the d clusters identified by the K-means method based on the rows of R. However R is unknown, we identify the d clusters based on its estimator R; see Step 5 of the algorithm in Section 3. The clustering based on R can only be successful if that based on R is successful, i.e.D 0 ∈ O d (g 0 ), where g 0 is the p 0 -vector with 1 as its first p 1 elements, 2 as its next p 2 elements, • • • , and d as its last p d elements.Given the block diagonal structure of B, condition D 0 ∈ O d (g 0 ) is likely to hold.For any p 0 -vector g with its elements taking integer values between 1 and d, it partitions {1, • • • , p 0 } into d subset.Let τ (g) denote the number of misclassified components by partition g.Assumption 6. D 0 ∈ O d (g 0 ).
larger than Scenarios I.In contrast, the results hardly change with different values of p 1 .Recall P A⊥B is the projection matrix onto the space M (I p − AA ⊤ ) Remark 3(iv).Tables 3-4 contain the means and standard deviations of the estimation errors for the factor loading spaces A A ⊤ − AA ⊤ F and B B ⊤ − P A⊥B F , where A is estimated by the eigenvectors of the matrix M in (11) with k 0 = 1, • • • , 5, see means and standard deviations (in parentheses) of A A ⊤ − AA ⊤ F and B B ⊤ − P A⊥B F in a simulation for Scenario I with 1000 replications, where A is estimated by the eigenvectors of M in (11) (with k 0 = 1, • • • , 5), or by those of Σy(k) Σy(k) ⊤ (for k = 0, 1, • • • , 5), and B is estimated in the similar manner.Both r 0 and r are assumed to be known.Estimation A A ⊤ − AA ⊤ F B B ⊤ − P A⊥B F Recall J c d+1 contains all the indices of the components of y t belonging to one of the d clusters.The means and standard deviations of the two types of misclassification errors E 1 = |J c d+1 ∩ J d+1 |/|J c d+1 | and E 2 = |J d+1 ∩ J c d+1 |/|J d+1 | over the 1000 replications are reported in Tables estimate d as an upper bound for d.As r j = 2 for j = 1, • • • , d, d = d occurs almost surely in our simulation.See Tables 7-8.Then the d clusters are obtained by performing the K-means clustering for the p 0 rows of R, where p 0 = p − | J d+1 |.As the error rates in estimating J c d+1 has already been reported in Tables5-6, we concentrate on the components of y t with indices in J c d+1 ∩ J c d+1 now, and count the number of them which were misplaced

r 0
and r are known r 0 and r are estimated p 1 = 25 p 1 = 50 p 1 = 75 p 1 = 100 p 1 = 25 p 1 = 50 p 1 = 75 p 1 the K-means clustering, i.e. τ .Both the means and the standard deviations of the error rates τ /| J c d+1 ∩ J c d+1 | over 1000 replications are reported in Tables 7-8.We also report the relative frequencies of d = d.Tables 7-8 show clearly that the K-means clustering identifies the latent clusters very accurately, and the difference in performance due to the estimating (r 0 , r) is also small.Table 8: The means and standard deviations (STD) of the error rates τ /| J c d+1 ∩ J c d+1 | and the relative frequencies of d = d in a simulation for Scenario II with 1000 replications with the numbers of factors r 0 and r either known or to be estimated.r 0 and r are known r 0 and r are estimated p 1 = 25 p 1 = 50 p 1 = 75 p 1 = 100 p 1 = 25 p 1 = 50 p 1 = 75 p 1
The heat-maps of this 11 × d matrix for d = d = 9 is presented in Figure3.The first cluster mainly consists of the companies in Consumer Staples, and Utilities, Clusters 2 and 3 contain the companies in, respectively, Health Care and Financials, Cluster 4 contains mainly some companies in Communication Service and Information Technology, Cluster 5 consists of the companies in Industrials and Materials, Cluster 6 are mainly the companies in Consumer Discretionary, Cluster 7 are mainly the companies in Real Estate.Cluster 8 is mainly the companies from Information Technology, Cluster 9 contains almost all companies in Energy and a small number of companies from each of 5 or 6 different sectors.To examine how stable the clustering is, we also include the results for d = 11 and d = 3 in Figure 3.When d is increased from 9 to 11, the original Cluster 1 is divided into new Clusters 1 and 11 with the former consisting of Consumer Staples, and the latter being Utilities.Furthermore the original Cluster 4 splits into new Clusters 4 and 10, while the other 7 original clusters are hardly changed.With d = 3, most companies in each of the 11 sectors stay in one cluster.For example, most companies in Financials are always in a separate group.If we take r 0 = 1 and r = 9, | J d+1 | = 12 and p 0 = 477 − 12 = 465.The clustering results with d = 9, 11 and 3 are presented in is presented in Figure 4.The first cluster mainly consists of the companies in Consumer Staples, Real Estate and Utilities, Clusters 2 and 3 contain the companies in, respectively, Health Care and Financials, Cluster 4 contains mainly some companies in Communication Service and Information Technology, Cluster 5 consists of the companies in Industrials and Materials, Cluster 6 are mainly the companies in Consumer Discretionary, Cluster 7 is a mixture of a small number of companies from each of 5 or 6 different sectors, Cluster 8 is mainly the companies from Information Technology, Cluster 9 contains almost all companies in Energy.To examine how stable the clustering is, we also include the results for d = 11 and d = 3 in Figure 4.When d is increased from 9 to 11, the original Cluster 1 is divided into new Clusters 1 and 11 with the former consisting of Consumer Staples and Utilities sectors, and the latter being Real Estate sector. Furthermore the original Cluster 7 splits into new Clusters 7 and 10, while the other 7 original clusters are hardly changed.With d = 3, most companies in each of the 11 sectors stay in one cluster.If we take r 0 = 1 and r = 12, the estimated J d+1 is unchanged.The clustering results with d = 9, 11 and 3 are presented in Figure 5. Comparing with Figure 4, there are some striking similarities: First the clustering with d = 3 are almost identical.For d = 9, the profiles of Clusters 2, • • • , 6, 8 and 9 are not significantly changed while Clusters 1 and 7 in Figure 4 are somehow mixed together in Figure 5.With d = 11, the profiles of Clusters 2 -6, 8 -10 in the two figures are about the same while Clusters 7 and 11 are mixed up across the two figures.

Figure 4 :Figure 5 :Figure 6 :
Figure 4: Heat-maps of the distributions of the stocks in each of the 11 industry sectors (corresponding to 11 rows) over d clusters (corresponding to d columns), with d = 9, 11 and 3.The estimated numbers of the common and cluster-specific factors are, respectively, r 0 = 1 and r = 9.d=9 tors and p 1−δ for cluster-specific factors.If there are s different strengths p 1−δ 1 , • • • , p 1−δs among the cluster-specific factors, we can search for the s largest local maximums amongR 1 , • • • , R J 0 −1 in Step 1. Moreover,Step 2 should be repeated s times to estimate s factor loadings corresponding to different factor strengths.While the asymptotic results can be extended accordingly, small values such as s ≤ 3 are sufficient for most practical applica-tions.

in
Step 5 which is not accuracy enough.It remains an open problem to evaluate how the clustering error is propagated into the estimation for r j and B j .

Table 3 :
The

Table 4 :
The means and standard deviations (in parentheses) of A A ⊤ − AA ⊤ F and B B ⊤ − P A⊥B F in a simulation for Scenario II with 1000 replications, where A is estimated by the eigenvectors of M in (11) (with

Table 6 :
The means and standard deviations (in parentheses) of the error ratesE 1 = |J c d+1 ∩ J d+1 |/|J c d+1 | and E 2 = |J d+1 ∩ J c d+1 |/|J d+1| in a simulation for Scenario II with 1000 replications with the 3 possible choices of threshold ωp in (16), and the numbers of factors r 0 and r either known or to be estimated.

Table 7 :
The means and standard deviations (STD) of the error rates τ /| J c d+1 ∩ J c d+1 | and the relative frequencies of d = d in a simulation for Scenario I with 1000 replications with the numbers of factors r 0 and r either known or to be estimated.
We consider the daily returns of the stocks listed in S&P500 in 31 December 2014 -31 December 2019.By removing those which were not traded on every trading day during the period, there are p = 477 stocks which were traded on n = 1259 trading days.Those