Entry-Wise Eigenvector Analysis and Improved Rates for Topic Modeling on Short Documents

Topic modeling is a widely utilized tool in text analysis. We investigate the optimal rate for estimating a topic model. Specifically, we consider a scenario with $n$ documents, a vocabulary of size $p$, and document lengths at the order $N$. When $N\geq c\cdot p$, referred to as the long-document case, the optimal rate is established in the literature at $\sqrt{p/(Nn)}$. However, when $N=o(p)$, referred to as the short-document case, the optimal rate remains unknown. In this paper, we first provide new entry-wise large-deviation bounds for the empirical singular vectors of a topic model. We then apply these bounds to improve the error rate of a spectral algorithm, Topic-SCORE. Finally, by comparing the improved error rate with the minimax lower bound, we conclude that the optimal rate is still $\sqrt{p/(Nn)}$ in the short-document case.


Introduction
In today's world, an immense volume of text data is generated in scientific research and in our daily lives.This includes research publications, news articles, posts on social media, electronic health records, and many more.Among the various statistical text models, the topic model (Hofmann, 1999;Blei et al., 2003) stands out as one of the most widely used.Given a corpus consisting of n documents written on a vocabulary of p words, let X = [X 1 , X 2 , . . ., X n ] ∈ R p×n be the word-document-count matrix, where X i (j) is the count of the jth word in the ith document, for 1 ≤ i ≤ n and 1 ≤ j ≤ p.Let A 1 , A 2 , . . ., A K ∈ R p be probability mass functions (PMFs).We call each A k a topic vector, which represents a particular distribution over words in the vocabulary.For each 1 ≤ i ≤ n, let N i denote the length of the ith document, and let w i ∈ R K be a weight vector, where w i (k) is the fractional weight this document puts on the kth topic, for 1 ≤ k ≤ K.In a topic model, the columns of X are independently generated, where the ith column satisfies: Here d 0 i ∈ R p is the population word frequency vector for the ith document, which admits a convex combination of the K topic vectors.The N i words in this document are sampled with replacement from the vocabulary using probabilities in d 0 i ; as a result, the word counts follow a multinomial distribution.Under this model, E[X] is a rank-K matrix.The statistical problem of interest is using X to estimate the two parameter matrices A = [A 1 , A 2 , . . ., A K ] and W = [w 1 , w 2 , . . ., w n ].
Since the topic model implies a low-rank structure behind the data matrix, spectral algorithms (Ke et al., 2023) have been developed for topic model estimation.Topic-SCORE (Ke and Wang, 2024) is the first spectral algorithm in the literature.It conducts singular value decomposition (SVD) on a properly normalized version of X, then uses the first K left singular vectors to estimate A, and finally uses Â to estimate W by weighted least-squares.Ke and Wang (2024) showed that the error rate on A is p/(nN ) up to a logarithmic factor, where N is the order of document lengths.It matches with the minimax lower bound (Ke and Wang, 2024) when N ≥ c • p for a constant c > 0, referred to as the long-document case.However, there are many application scenarios with N = o(p), referred to as the shortdocument case.For example, if we consider a corpus consisting of abstracts of academic publications (e.g., see Ke et al. (2023)), N is usually between 100 and 200, but p can be a few thousands or even larger.In this short-document case, Ke and Wang (2024) observed a gap between the minimax lower bound and the error rate of Topic-SCORE.They posted the following questions: Is the optimal rate still p/(N n) in the short-document case?If so, can spectral algorithms still achieve this rate?
In this paper, we give answers to these questions.We discovered that the gap between the minimax lower bound and the error rate of Topic-SCORE in the short-document case came from the unsatisfactory entry-wise large-deviation bounds for the empirical singular vectors.While the analysis in Ke and Wang (2024) is effective for long documents, there is considerable room for improvement in the short-document case.We use new analysis to obtain much better large-deviation bounds when N = o(p).Our strategy includes two main components: one is an improved non-stochastic perturbation bound for SVD allowing severe heterogeneity in the population singular vectors, and the other is leveraging a decoupling inequality (de la Pena and Montgomery-Smith, 1995) to control the spectral norm of a random matrix with centered multinomial-distributed columns.These new ideas allow us to obtain satisfactory entry-wise large-deviation bounds for empirical singular vectors across the entire regime of N ≥ log 3 (n).As a consequence, we are able to significantly improve the error rate of Topic-SCORE in the short-document case.This answers the two questions posted by Ke and Wang (2024): The optimal rate is still p/(N n) in the short-document case, and Topic-SCORE still achieves this optimal rate.
Additionally, inspired by our analysis, we have made a modification to Topic-SCORE to better incorporate document lengths.We also extend the asymptotic setting in Ke and Wang (2024) to a weak-signal regime allowing the K topic vectors to be extremely similar to each other.

Related Literature
Many topic modeling algorithms have been proposed in the literature, such as LDA (Blei et al., 2003), the separable NMF approach Arora et al. (2012Arora et al. ( , 2013)), the method in Bansal et al. (2014) that uses a low-rank approximation to the original data matrix, Topic-SCORE (Ke and Wang, 2024), and LOVE (Bing et al., 2020).Theoretical guarantees were derived for these methods, but unfortunately, most of them had non-optimal rates even when N ≥ c • p. Topic-SCORE and LOVE are the two that achieve the optimal rate when N ≥ c • p.However, LOVE has no theoretical guarantee when N = o(p); Topic-SCORE has a theoretical guarantee across the entire regime, but the rate obtained by Ke and Wang (2024) is non-optimal when N = o(p).Therefore, our results address a critical gap in the existing literature by determining the optimal rate for the short-document case for the first time.
Entry-wise eigenvector analysis (Erdős et al., 2013;Fan et al., 2018Fan et al., , 2022;;Abbe et al., 2020;Chen et al., 2021;Ke and Wang, 2022) provides large-deviation bounds or higherorder expansions for individual entries of the leading eigenvectors of a random matrix.There are two types of random matrices, i.e., the Wigner type (e.g., in network data and pairwise comparison data) and the Wishart type (e.g., in factor models and spiked covariance models (Paul, 2007)).The random matrices in topic models are the Wishart type, and hence, techniques for the Wigner type, such as the leave-one-out approach (Ke and Wang, 2022), are not a good fit.We cannot easily extend the techniques (Fan et al., 2018;Chen et al., 2021) for spiked covariance models either.One reason is that the multinomial distribution has heavier-than-Gaussian tails (especially for short documents), and using the existing techniques only give non-sharp bounds.Another reason is the severe word frequency heterogeneity (Zipf, 2013) in natural languages, which calls for bounds whose orders are different for different entries of an eigenvector.Our analysis overcomes these challenges.

Organization and Notations
The rest of this paper is organized as follows.Section 2 presents our main results about entry-wise eigenvector analysis for topic models.Section 3 applies these results to obtain improved error bounds for the Topic-SCORE algorithm and determine the optimal rate in the short-document case.Section 4 describes the main technical components, along with a proof sketch.Section 5 concludes the paper with discussions.The proofs of all theorems are relegated to the Appendices A-E.
Throughout this paper, for a matrix B, let B(i, j) or B ij represent the (i, j)-th entry.We denote ∥B∥ as its operator norm and ∥B∥ 2→∞ as the 2-to-∞ norm, which is the maximum ℓ 2 norm across all rows of B. For a vector b, b(i) or b i represents the i-th component.We denote ∥b∥ 1 and ∥b∥ as the ℓ 1 and ℓ 2 norms of b, respectively.The vector 1 n stands for an all-one vector of dimension n.Unless specified otherwise, {e 1 , e 2 , . . ., e p } denotes the standard basis of R p .Furthermore, we write a n ≫ b n or b n ≪ a n if b n /a n = o(1) for a n , b n > 0; and we denote a n ≍ b n if C −1 b n < a n < Cb n for some constant C > 1.
2 Entry-Wise Eigenvector Analysis for Topic Models Let X ∈ R p×n be the word-count matrix following the topic model in (1).We introduce the empirical frequency matrix D = [d 1 , d 2 , . . ., d n ] ∈ R p×n , defined by: Under the model in (1), we have We observe that D 0 is a rank-K matrix; furthermore, the linear space spanned by the first K left singular vectors of D 0 is the same as the column space of A. Ke and Wang (2024) discovered that there is a low-dimensional simplex structure that explicitly connects the first K left singular vectors of D 0 with the target topic matrix A. This inspired SVD-based methods for estimating A. However, if one directly conducts SVD on D, the empirical singular vectors can be noisy because of severe word frequency heterogeneity in natural languages (Zipf, 2013).In what follows, we first introduce a normalization on D in Section 2.1 to handle word frequency heterogeneity and then derive entry-wise large-deviation bounds for the empirical singular vectors in Section 2.2.

A Normalized Data Matrix
We first explain why it is inappropriate to conduct SVD on D.
The singular vectors of D are the same as the eigenvectors of 1), the columns of Z are centered multinomial-distributed random vectors; moreover, using the covariance matrix formula for multinomial distributions, we have Here AΣ W A ′ is a rank-K matrix whose eigen-space is the same as the column span of A. However, because of the diagonal matrix M 0 , the eigen-space of E[DD ′ ] is no longer the same as the column span of A. We notice that the jth diagonal of M 0 captures the overall frequency of the jth word across the whole corpus.Hence, this is an issue caused by word frequency heterogeneity.The second term in (3) is larger when N is smaller.This implies that the issue becomes more severe for short documents.
To resolve this issue, we consider a normalization of D to M −1/2 0 D. It follows that: Now, the second term is proportional to an identify matrix and no longer affects the eigenspace.Furthermore, the eigen-space of the first term is the column span of M −1/2 0 A, and hence, we can use the eigenvectors to recover M −1/2 0 A (then A is immediately known).In practice, M 0 is not observed, so we replace it by its empirical version: We propose to normalize D to M −1/2 D before conducting SVD.Later, the singular vectors of M −1/2 D will be used in Topic-SCORE to estimate A (see Section 3).This normalization is similar to the pre-SVD normalization in Ke and Wang (2024) but not exactly the same.Inspired by analyzing a special case where N i = N , Ke and Wang (2024) proposed to normalize D to M −1/2 D, where M = diag(n −1 n i=1 d i ).They continued using M in general settings, but we discover here that the adjustment of M to M is necessary when N i 's are unequal.
Remark 1 For extremely low-frequency words, the corresponding diagonal entries of M are very small.This causes an issue when we normalize D to M −1/2 D. Fortunately, such an issue disappears if we pre-process data.As a standard pre-processing step for topic modeling, we either remove those extremely low-frequency words or combine all of them into a single "meta-word".We recommend the latter approach.In detail, let L ⊂ {1, 2, . . ., p} be the set of words such that M (j, j) is below a proper threshold t n (e.g., t n can be 0.05 times the average of diagonal entries of M ).We then sum up all rows of D with indices in L to a single row.Let D * ∈ R (p−|L|+1)×n be the processed data matrix.The matrix D * still has a topic model structure, where each new topic vector results from a similar row combination on the corresponding original topic vector.

Remark 2
The normalization of D to M −1/2 D is reminiscent of the Laplacian normalization in network data analysis, but the motivation is very different.In many network models, the adjacency matrix satisfies that B = B 0 + Y , where B 0 is a low-rank matrix and Y is a generalized Wigner matrix.Since E[Y ] is a zero matrix, the eigen-space of EB is the same as that of B 0 .Hence, the role of the Laplacian normalization is not correcting the eigen-space but adjusting the signal-to-noise ratio (Ke and Wang, 2022).In contrast, our normalization here aims to turn E[ZZ ′ ] into an identity matrix (plus a small matrix that can be absorbed into the low-rank part).We need such a normalization even under moderate word frequency heterogeneity (i.e., the frequencies of all words are at the same order).
2.2 Entry-Wise Singular Analysis for M −1/2 D For each 1 ≤ k ≤ K, let ξk ∈ R p denote the kth left singular vector of M −1/2 D. Recall that D 0 = ED.In addition, define: However, the singular vectors of M −1/2 0 D 0 are not the population counterpart of ξk 's.In light of (4), we define: We aim to derive a large-deviation bound for each individual row of ( Ξ − Ξ), subject to a column rotation of Ξ.We need a few assumptions.Let h j = K k=1 A k (j) for 1 ≤ j ≤ p. Define: Here Σ A and Σ W are called the topic-topic overlapping matrix and the topic-topic concurrence matrix, respectively, (Ke and Wang, 2024).It is easy to see that Σ W is properly scaled.We remark that Σ A is also properly scaled, because Assumption 1 Let h max = max 1≤j≤p h j , h min = min 1≤j≤p h j and h = 1 p p j=1 h j .We assume: Assumption 2 For a constant c 2 ∈ (0, 1) and a sequence β n ∈ (0, 1), we assume: Assumption 1 is related to word frequency heterogeneity.Each h j captures the overall frequency of word j, and h = p −1 j h j = p −1 k ∥A k ∥ 1 = K/p.By Remark 1, all extremely low-frequency words have been combined in pre-processing.It is reasonable to assume that h min is at the same order of h.Meanwhile, we put no restrictions here on h max , so that h j 's can still be at different orders.
Assumption 2 is about topic weight balance and between-topic similarity.Σ W can be regarded as an affinity matrix of w i 's.It is mild to assume that Σ W is well-conditioned.
In a special case where N i = N and each w i is degenerate, Σ W is a diagonal matrix whose kth diagonal entry is the fraction of documents that put all weights on topic k; hence, λ min (Σ W ) ≥ c 2 is interpreted as "topic weight balance".Regarding Σ A , we have seen that it is properly scaled (its maximum eigenvalue is at the constant order).When K topic vectors are exactly the same, λ min (Σ A ) = 0; when the topic vectors are not the same, λ min (Σ A ) ̸ = 0, and it measures the signal strength.Ke and Wang (2024) assumed that λ min (Σ A ) is bounded below by a constant, but we allow weaker signals by allowing λ min (Σ A ) to diminish as n → ∞.We also require a lower bound on Σ A (k, ℓ), meaning that there should be certain overlaps between any two topics.This is reasonable as some commonly used words are not exclusive to any one topic and tend to occur frequently (Ke and Wang, 2024).
The last assumption is about the vocabulary size and document lengths.
Assumption 3 There exists N ≥ 1 and a constant c 3 ∈ (0, 1) such that c 3 N ≤ N i ≤ c −1 3 N for all 1 ≤ i ≤ n.In addition, for an arbitrary constant C 0 > 0: In Assumption 3, the first two inequalities restrict that N and p are between log 3 (n) and n C 0 , for an arbitrary constant C 0 > 0. This covers a wide regime, including the scenarios of both long documents (N ≥ c • p) and short documents (N = o(p)).The third inequality is needed so that the canonical angles between the empirical and population singular spaces converge to zero, which is necessary for our singular vector analysis.This condition is mild, as N n is the order of total word count in the corpus, which is often much larger than p.
With these assumptions, we now present our main theorem.
Theorem 1 (Entry-wise singular vector analysis) Fix K ≥ 2 and positive constants c 1 , c 2 , c 3 , and C 0 .Under the model (1), suppose Assumptions 1-3 hold.For any constant C 1 > 0, there exists C 2 > 0 such that with probability 1 − n −C 1 , there is an orthogonal matrix O ∈ R K×K satisfying that simultaneously for 1 ≤ j ≤ p: The constant C 2 only depends on C 1 and (K, c 1 , c 2 , c 3 , C 0 ).
In Theorem 1, we do not assume any gap among the K singular values of M −1/2 0 D 0 ; hence, it is only possible to recover Ξ up to a column rotation O.The sin-theta theorem (Davis and Kahan, 1970) , but it is insufficient for analyzing spectral algorithms for topic modeling (see Section 3).We need a bound for each individual row of ( Ξ − ΞO ′ ), and this bound should depend on h j properly.
We compare Theorem 1 with the result in Ke and Wang (2024).They assumed that β −1 n = O(1), so their results are only for the strong-signal regime.They showed that when n is sufficiently large: When N ≥ c • p (long documents), it is the same bound as in Theorem 1 (with β n = 1).However, when N = o(p) (short documents), it is strictly worse than Theorem 1.We obtain better bounds than those in Ke and Wang (2024) because of new proof ideas, especially the use of refined perturbation analysis for SVD and a decoupling technique for U-statistics (see Section 4.2).

Improved Rates for Topic Modeling
We apply the results in Section 2 to improve the error rates of topic modeling.Topic-SCORE (Ke and Wang, 2024) is a spectral algorithm for estimating the topic matrix A. It achieves the optimal rate in the long-document case (N ≥ c • p).However, in the short-document case (N = o(p)), the known rate of Topic-SCORE does not match with the minimax lower bound.We address this gap by providing better error bounds for Topic-SCORE.Our results reveal the optimal rate for topic modeling in the short-document case for the first time.

The Topic-Score Algorithm
Let ξ1 , ξ2 , . . ., ξK be as in Section 2. Topic-SCORE first obtains word embeddings from these singular vectors.Note that M −1/2 D is a non-negative matrix.By Perron's theorem (Horn and Johnson, 1985), under mild conditions, ξ1 is a strictly positive vector.Define R ∈ R p×(K−1) by: R(j, k) = ξk+1 (j)/ ξ1 (j), 1 Let r′ 1 , r′ 2 , . . ., r′ p denote the rows of R.Then, rj is a (K − 1)-dimensional embedding of the jth word in the vocabulary.This is known as the SCORE embedding Jin (2015); Ke and Jin (2023), which is now widely used in analyzing heterogeneous network and text data.
Ke and Wang (2024) discovered that there is a simplex structure associated with these word embeddings.Specifically, let ξ 1 , ξ 2 , . . ., ξ K be the same as in (7) and define the population counterpart of R as R, where: Let r ′ 1 , r ′ 2 , . . ., r ′ p denote the rows of R. All these r j are contained in a simplex S ⊂ R K−1 that has K vertices v 1 , v 2 , . . ., v K (see Figure 1).If the jth word is an anchor word (Arora et al., 2012;Donoho and Stodden, 2004) (an anchor word of topic k satisfies that A k (j) ̸ = 0 and A ℓ (j) = 0 for all other ℓ ̸ = k), then r j is located at one of the vertices.Therefore, as long as each topic has at least one anchor word, we can apply a vertex hunting (Ke and Wang, 2024) algorithm to recover the K vertices of S. By definition of a simplex, each point inside S can be written uniquely as a convex combination of the K vertices, and the K-dimensional vector consisting of the convex combination coefficients is called the barycentric coordinate.After recovering the vertices of S, we can easily compute the barycentric coordinate π j ∈ R K for each r j .Write Π = [π 1 , π 2 , . . ., π p ] ′ .Ke and Wang (2024) showed that: Therefore, we can recover A k by taking the kth column of M 1/2 0 diag(ξ 1 )Π and re-normalizing it to have a unit ℓ 1 -norm.This gives the main idea behind Topic-SCORE (see Figure 1).
A ∝ M 1/2 0 diag(ξ 1 ) Figure 1: An illustration of Topic-SCORE in the noiseless case (K = 3).The blue dots are r j ∈ R K−1 (word embeddings constructed from the population singular vectors), and they are contained in a simplex with K vertices.This simplex can be recovered from a vertex hunting algorithm.Given this simplex, each r j has a unique barycentric coordinate π j ∈ R K .The topic matrix A is recovered from stacking together these π j 's and utilizing M 0 and ξ 1 .
The full algorithm is given in Algorithm 1.It requires plugging in a vertex hunting (VH) algorithm.A VH algorithm aims to estimate v 1 , v 2 , . . ., v K from the noisy point cloud {r j } 1≤j≤p .There are many existing VH algorithms (see sec 3.4 of Ke and Jin (2023)).A VH algorithm is said to be efficient if it satisfies that max 1≤k≤K ∥v k −v k ∥ ≤ C max 1≤j≤p ∥r j −r j ∥ (subject to a permutation of v1 , v2 , . . ., vK ).We always plug in an efficient VH algorithm, such as the successive projection algorithm (Araújo et al., 2001), the pp-SPA algorithm (Jin et al., 2024), and several algorithms in sec 3.4 of Ke and Jin (2023).
Algorithm 1 Topic-SCORE Input: D, K, and a vertex hunting (VH) algorithm.
Additionally, after Â is obtained, Ke and Wang (2024) suggested to estimate w 1 , w 2 , . . ., w n as follows.We first run a weighted least-squares to obtain ŵ * i : Then, set all the negative entries of ŵ * i to zero and re-normalize the vector to have a unit ℓ 1 -norm.The resulting vector is ŵi .
Remark 3 In real-world applications, both n and p can be very large.However, since R is constructed from only a few singular vectors, its rows are only in dimension (K − 1).It suggests that Topic-SCORE leverages a 'low-dimensional' simplex structure and is scalable to large datasets.When K is bounded, the complexity of Topic-SCORE is at most O(np min{n, p}) (Ke and Wang, 2024).The real computing time was also reported in Ke and Wang ( 2024) for various values of (n, p).For example, when both n and p are a few thousands, it takes only a few seconds to run Topic-SCORE.

The Improved Rates for Estimating A and W
We provide the error rates of Topic-SCORE.First, we study the word embeddings rj .By (10), rj is constructed from the jth row of Ξ.Therefore, we can apply Theorem 1 to derive a large-deviation bound for rj .
Without loss of generality, we set C 1 = 4 henceforth, transforming the event probability 1 − n −C 1 in Theorem 1 to 1 − o(n −3 ).We also use C to denote a generic constant, whose meaning may change from one occurrence to another.In all instances, C depends sorely on K and the constants (c 1 , c 2 , c 3 , C 0 ) in Assumptions 1-3.
Theorem 2 (Word embeddings) Under the setting of Theorem 1, with probability 1 − o(n −3 ), there exists an orthogonal matrix Ω ∈ R (K−1)×(K−1) such that simultaneously for 1 ≤ j ≤ p: Next, we study the error of Â.The ℓ 1 -estimation error is L( Â, A) := K k=1 ∥ Âk − A k ∥ 1 , subject to an arbitrary column permutation of Â.For ease of notation, we do not explicitly denote this permutation in theorem statements, but it is accounted for in the proofs.For each 1 ≤ j ≤ p, let â′ j ∈ R K and a ′ j ∈ R K denote the jth row of Â and A, respectively.We can re-write the ℓ 1 -estimation error as L( Â, A) = p j=1 ∥â j − a j ∥ 1 .The next theorem provides an error bound for each individual âj , and the aggregation of these bounds yields an overall bound for L( Â, A): Theorem 3 (Estimation of A) Under the setting of Theorem 1, we additionally assume that each topic has at least one anchor word.With probability 1 − o(n −3 ), simultaneously for 1 ≤ j ≤ p: Furthermore, with probability 1 − o(n −3 ), the ℓ 1 -estimation error satisfies that: Theorem 3 improves the result in Ke and Wang (2024) in two aspects.First, Ke and Wang (2024) assumed β −1 n = O(1), so their results did not allow for weak signals.Second, even when β −1 n = O(1), their bound is worse than ours by a factor similar to that in (9).Finally, we have the error bound for estimating w i 's using the estimator in (12).
Theorem 4 (Estimation of W ) Under the setting of Theorem 3, with probability 1 − o(n −3 ), subject to a column permutation of Ŵ : In Theorem 4, there are two terms in the error bound of ŵi .The first term comes from the estimation error in Â, and the second term is from noise in d i .In the strong-signal case of β −1 n = O(1), we can compare Theorem 4 with the bound for ŵi in Ke and Wang (2024).The bound there also has two terms: its second term is similar to ours, but its first term is strictly worse.

Connections and Comparisons
There have been numerous results about the error rates of estimating A and W .For example, (Arora et al., 2012) provided the first explicit theoretical guarantees for topic modeling, but they did not study the statistical optimality of their method.Recently, the statistical literature aimed to understand the fundamental limits of topic modeling.Assuming and Wang, 2024;Bing et al., 2020) gave a minimax lower bound, p/(N n), for the rate of estimating A, and refs.(Wu et al., 2023;Klopp et al., 2023) gave a minimax lower bound, 1/ √ N , for estimating each w i .For estimating A, when β −1 n = O(1), the existing theoretical results are summarized in Table 1.Ours is the only one that matches the minimax lower bound across the entire regime.In the long-document case (N ≥ c • p, Cases 1-2 in Table 1), the error rates in Ke and Wang (2024); Bing et al. (2020) together have matched the lower bound, concluding that p/(N n) is indeed the optimal rate.However, in the short-document case (N = o(p), Case 3 in Table 1), there was a gap between the lower bound and the existing error rates.Our result addresses the gap and concludes that p/(N n) is still the optimal rate.When β n = o(1), the error rates of estimating A were rarely studied.We conjecture that p/(N nβ 2 n ) is the optimal rate, and the Topic-SCORE algorithm is still rate-optimal.We emphasize that our rate is not affected by severe word frequency heterogeneity.As long as h min / h is lower bounded by a constant (see Assumption 1 and explanations therein), our rate is always the same, regardless of the magnitude of h max .In contrast, the error rate in Bing et al. (2020) is sensitive to word frequency heterogeneity, with an extra factor of h max /h min that can be as large as p.There are two reasons that enable us to achieve a flat rate even under severe word frequency heterogeneity: one is the proper normalization of data matrix, as described in Section 2.1, and the other is the careful analysis of empirical singular vectors (see Section 4).
For estimating W , when β −1 n = O(1), our error rate in Theorem 4 matches the minimax lower bound if n ≥ p log(n).Our approach to estimating W involves first obtaining Â and then regressing d i on Â to derive ŵi .The condition n ≥ p log(n) ensures that the estimation Table 1: A summary of the existing theoretical results for estimating A (n is the number of documents, p is the vocabulary size, N is the order of document lengths, and h max and h min are the same as in ( 8)).Cases 1-3 refer to N ≥ p 4/3 , p ≤ N < p 4/3 , and N < p, respectively.For Cases 2-3, the sub-cases 'a' and 'b' correspond to n ≥ max{N p 2 , p 3 , N 2 p 5 } and n < max{N p 2 , p 3 , N 2 p 5 }, respectively.We have translated the results in each paper to the bounds on L( Â, A), with any logarithmic factor omitted.

Case 1 Case 2a Case 2b
Case 3a Case 3b Ke and Wang ( 2024) Arora et al. ( 2012) error in Â does not dominate the overall error.This condition is often met in scenarios where a large number of documents can be collected, but the vocabulary size remains relatively stable.However, if n < p log(n), a different approach is necessary, requiring the estimation of W first.This involves using the right singular vectors of M −1/2 D. While our analysis has primarily focused on the left singular vectors, it can be extended to study the right singular vectors as well.

Proof Ideas
Our main result is Theorem 1, which provides entry-wise large-deviation bounds for singular vectors of M −1/2 D. Given this theorem, the proofs of Theorems 2-4 are similar to those in (Ke and Wang, 2024) and thus relegated to the appendix.In this section, we focus on discussing the proof techniques of Theorem 1.

Why the Leave-One-Out Technique Fails
Leave-one-out (Abbe et al., 2020;Ke and Wang, 2022) is a common technique in entry-wise eigenvector analysis for a Wigner-type random matrix , where B 0 is a symmetric non-stochastic low-rank matrix and Y is a symmetric random matrix whose upper triangle consists of independent mean-zero variables.One example of such matrices is the adjacency matrix of a random graph generated from the block-model family (Jin, 2015).However, our target here is the singular vectors of M −1/2 D, which are the eigenvectors of B := M −1/2 DD ′ M −1/2 .This is a Wishart-type random matrix, whose upper triangular entries are not independent.We may also construct a symmetric matrix: p+n) .
The eigenvectors of G take the form ûk = ( ξ′ k , η′ k ) ′ , 1 ≤ k ≤ K, where ξk ∈ R p and ηk ∈ R n are the kth left and right singular vectors of M −1/2 D, respectively.Unfortunately, the upper triangle of G still contains dependent entries.Some dependence is from the normalization matrix M .It may be addressed by using the techniques developed by Ke and Wang (2022) in studying graph Laplacian matrices.A more severe issue is the dependence among entries in D. According to basic properties of multinomial distributions, D only has column independence but no row independence.As a result, even after we replace M by M 0 , the jth row and column of G are still dependent of the remaining ones, for each 1 ≤ j ≤ p.In conclusion, we cannot apply the leave-one-out technique on either B or G.

The Proof Structure in Ke and Wang (2024) and Why It Is Not Sharp for Short Documents
Our entry-wise eigenvector analysis primarily follows the proof structure in Ke and Wang (2024).Recall that ξk ∈ R p is the kth left singular vector of M −1/2 D. Define: Since the identify matrix in G does not affect the eigenvectors, ξk is the kth eigenvector of G. Additionally, it follows from ( 7) and (4) that ξ k is the kth eigenvector of G 0 .By (4): The entry-wise eigenvector analysis in Ke and Wang (2024) has two steps. • Step 1: Non-stochastic perturbation analysis.In this step, no distributional assumptions are made on G.The analysis solely focuses on connecting the perturbation from Ξ to Ξ with the perturbation from G 0 to G.They showed in Lemma F.1 (Ke and Wang, 2024): • Step 2: Large-deviation analysis of G − G 0 .In this step, Ke and Wang (2024) derived the large-deviation bounds for ∥G − G 0 ∥ and ∥e ′ j (G − G 0 )∥ under the multinomial model (1).For example, they showed in Lemma F.5 (Ke and Wang, 2024) that when n is properly large, with high probability: However, when N = o(p) (short documents), neither step is sharp.In (15), the second term ∥e ′ j (G − G 0 )∥ was introduced as an upper bound for ∥e ′ j (G − G 0 ) Ξ∥, but this bound is too crude.In Section 4.3, we will conduct careful analysis of ∥e ′ j (G − G 0 ) Ξ∥ and introduce a new perturbation bound which significantly improves (15).In ( 16), the spectral norm is controlled via an ε-net argument (Vershynin, 2012), which reduces the analysis to studying a quadratic form of Z; Ke and Wang (2024) analyzed this quadratic form by applying martingale Bernstein inequality.Unfortunately, in the short-document case, it is hard to control the conditional variance process of the underlying martingale.In Section 4.4, we address it by leveraging the matrix Bernstein inequality (Tropp, 2012) and the decoupling inequality (de la Pena and Montgomery-Smith, 1995; De la Pena and Giné, 2012) for Ustatistics.

Non-Stochastic Perturbation Analysis
In this subsection, we abuse notations to use G and G 0 to denote two arbitrary p × p symmetric matrices, with rank(G 0 ) = K.For 1 ≤ k ≤ K, let λk and λ k be the kth largest eigenvalue (in magnitude) of G and G 0 , respectively, and let ξk ∈ R p and ξ k ∈ R p be the associated eigenvectors.Write Λ = diag( λ1 , λ2 , . . ., λK ), Ξ = [ ξ1 , ξ2 , . . ., ξK ], and define Λ and Ξ similarly.Let U ∈ R K×K and V ∈ R K×K be such that its columns contain the left and right singular vectors of Ξ′ Ξ, respectively.Define sgn( Ξ′ Ξ) = U ′ V .For any matrix B and q > 0, let for some c 0 ∈ (0, 1).Consider an arbitrary p × p diagonal matrix Γ = diag(γ 1 , γ 2 , . . ., γ p ), where: then for the orthogonal matrix O = sgn( Ξ′ Ξ), it holds simultaneously for 1 ≤ j ≤ p that: Since γ j is an upper bound for ∥e ′ j Ξ∥∥G − G 0 ∥ + ∥e ′ j (G − G 0 )Ξ∥, we can interpret the result in Lemma 1 as: Comparing ( 17) with ( 15), the second term has been reduced.Since Ξ projects the vector e In particular, this is true for the G and G 0 defined in (13).Hence, there is a significant improvement over the analysis in Ke and Wang (2024).

Large-Deviation Analysis of
In this subsection, we focus on the specific G and G 0 as defined in (13).The crux of proving Theorem 1 lies in determining the upper bound γ j as defined in Lemma 1.This is accomplished through the following lemma.
Lemma 2 Under the settings of Theorem 1, let G and G 0 be as in (13).For any constant C 1 > 0, there exists C 3 > 0 such that with probability 1−n −C 1 , simultaneously for 1 ≤ j ≤ p: The constant C 3 only depends on C 1 and (K, c 1 , c 2 , c 3 , C 0 ).
We compare the bound for ∥G−G 0 ∥ in Lemma 2 with the one in Ke and Wang (2024) as summarized in ( 16).There is a significant improvement when N ≤ p 2 .This improvement primarily stems from the application of a decoupling inequality for U-statistics, as elaborated below.
We outline the proof of the bound for 41)-( 42) in Appendix A, G − G 0 decomposes into the sum of four matrices, where it is most subtle to bound the spectral norm of the fourth matrix: which is a sum of n independent matrices.We apply the matrix Bernstein inequality Tropp (2012) (Theorem 6) to obtain that if there exist b > 0 and σ 2 > 0 such that ∥X i ∥ ≤ b almost surely for all i and ∥ n i=1 EX 2 i ∥ ≤ σ 2 , then for every t > 0, Determination of b and σ 2 requires upper bounds for ∥X i ∥ and ∥EX 2 i ∥.Since each X i is equal to a rank-1 matrix minus its expectation, it reduces to deriving large-deviation bounds for ∥M −1/2 0 z i ∥ 2 .Note that each z i can be equivalently represented by , where I 2 is a term that can be controlled using standard large-deviation inequalities, and: The remaining question is how to bound |I 1 |.We notice that I 1 is a U-statistic with degree 2. The decoupling inequality (de la Pena and Montgomery-Smith, 1995; De la Pena and Giné, 2012) is a useful tool for studying U-statistics.
Theorem 5 (A special decoupling inequality (De la Pena and Giné, 2012)) Let {X m } N m=1 be a sequence of i.i.d.random vectors in R d , and let { X m } N m=1 be an independent copy of {X m } N m=1 .Suppose that h : R 2d → R is a measurable function.Then, there exists a constant C 4 > 0 independent of n, m, d such that for all t > 0: m=1 be an independent copy of {T im } N i m=1 .By Theorem 5, the large-deviation bound of I 1 can be inferred from the large-deviation bound of: Using h(T im 1 , T im 2 ) to denote the summand in the above sum, we have a decomposition: The second term is a sum of independent variables and can be controlled by standard large-deviation inequalities.Hence, the analysis of I 1 reduces to the analysis of We re-write I * 1 as: Since ỹ is independent of y, we apply large-deviation inequalities twice.First, conditional on ỹ, I * 1 is a sum of N i independent variables (randomness comes from T im 's).We apply the Bernstein inequality to get a large-deviation bound for I * 1 , which depends on a quantity σ 2 (ỹ).Next, since σ 2 (ỹ) can also be written as a sum of N i independent variables (randomness comes from T im 's), we apply the Bernstein inequality again to obtain a large-deviation bound for σ 2 (ỹ).Combining two steps gives the large-deviation bound for Remark 4 The decoupling inequality is employed multiple times to study other U-statisticstype quantities arising in our proof.For example, recall that (G − G 0 ) decomposes into the sum of four matrices, and we have only discussed how to bound ∥E 4 ∥.In the analysis of ∥E 2 ∥ and ∥E 3 ∥, we need to bound other quadratic terms involving a sum over (i, m), with 1 ≤ i ≤ n and 1 ≤ m ≤ N i .In that case, we need a more general decoupling inequality.We refer readers to Theorem 8 in Appendix A for more details.
Remark 5 The analysis in Ke and Wang (2024) uses an ϵ-net argument (Vershynin, 2012) and the martingale Bernstein inequality (Freedman, 1975) to study ∥E 4 ∥.In our analysis, we use the matrix Bernstein inequality (Tropp, 2012), instead of the ϵ-net argument.The matrix Bernstein inequality enables us to tackle each quadratic term related to each i separately instead of handling complicated quadratic terms involving summation over i and m simultaneously.Additionally, we adopt the decoupling inequality for U-statistics (de la Pena and Montgomery-Smith, 1995; De la Pena and Giné, 2012), instead of the martingale Bernstein inequality, to study all the quadratic terms arising in our analysis.The decoupling inequality converts the tail anaysis of quadratic terms into tail analysis of (conditionally) independent sums.It provides sharper bounds when the variables have heavy tails (which is the case for the word counts in a topic model, especially when documents are short).

Proof sketch of Theorem 1
We combine the non-stochastic perturbation result in Lemma 1 and the large-deviation bounds in Lemma 2 to prove Theorem 1.By Lemma 4, |λ K | ≥ C −1 nβ n .It follows from Weyl's inequality, the first claim in Lemma 2, and the assumption of p log 2 (n) ≤ N nβ 2 n that with probability 1 − n −C 1 : In addition, it can be shown (see Lemma 4) that ∥e ′ j Ξ∥ ≤ Ch 1/2 j .Combining this with the two claims in Lemma 2 gives that with probability 1 − n −C 1 : We hope to apply Lemma 1.This requires obtaining a bound for Similar to the analysis of ∥e ′ j (G − G 0 )Ξ∥, we can show (see the proofs of Lemmas 7 and 8, such as ( 75) where the last inequality is because of p log 2 (n) ≤ N n.We immediately have: We then apply Lemma 1 to get ∥e The claim of Theorem 1 follows immediately by plugging in the value of γ j as given above.

Summary and Discussion
The topic model imposes a "low-rank plus noise" structure on the data matrix.However, the noise is not simply additive; rather, it consists of centered multinomial random vectors.The eigenvector analysis in a topic model is more complex than standard eigenvector analysis for random matrices.Firstly, the entries of the data matrix are weakly dependent, making techniques such as leave-one-out inapplicable.Secondly, due to the significant word frequency heterogeneity in natural languages, entry-wise eigenvector analysis becomes much more nuanced, as different entries of the same eigenvector have significantly different bounds.Additionally, the data exhibit Bernstein-type tails, precluding the use of random matrix theory tools that assume sub-exponential entries.While we build on the analysis in Ke and Wang (2024), we address these challenges with new proof ideas.Our results provide the most precise eigenvector analysis and rate-optimality results for topic modeling, to the best of our knowledge.
A related but more ambitious goal is obtaining higher-order expansions of the empirical singular vectors.Since the random matrix under study in the topic model is the Wishart type, we can possibly borrow techniques in (Bloemendal et al., 2016) to study the joint distribution of empirical singular values and singular vectors.In this paper, we assume the number of topics, K, is finite, but our analysis can be easily extended to the scenario of a growing K (e.g., K = O(log(n))).We assume min{p, N } ≥ log 3 (n).When p < log 3 (n), it becomes a low-dimensional eigenvector analysis problem, which is easy to tackle.When N < log 3 (n), it is the extremely short documents case (i.e., each document has only a finite length, say, fewer than 20, as in documents such as Tweets).We leave it to future work.

Appendix A. Preliminary Lemmas and Theorems
In this section, we collect the preliminaries lemmas and theorems that will be used in the entry-wise eigenvector analysis.Under Assumption 3, N i ≍ N ≍ N .Therefore, throughout this section and subsequent sections, we always assume N = N without loss of generality.
The first lemma describes the estimates of the entries in M 0 and reveals its relation to the underlying frequency parameters, and further provides the large-deviation bound for the normalization matrix M .
, and h j = K k=1 A k (j) for 1 ≤ j ≤ p. Suppose the conditions in Theorem 1 hold.Then: and for some constant C > 0, with probability 1 − o(n −3 ), simultaneously for all 1 ≤ j ≤ p.Furthermore, with probability 1 − o(n −3 ), Remark 6 In this lemma and other subsequent lemmas, "with probability 1 − o(n −3 )" can always be replaced by "with probability 1 − n −C 1 ", for an arbitrary constant C 1 > 0. The small-probability events in these lemmas come from the Bernstein inequality or the matrix Bernstein inequality.These inequalities concern small-probability events associated with an arbitrary probability δ ∈ (0, 1), and the high-probability bounds depend on log(1/δ).When δ = n −C 1 , log(1/δ) = C 1 log(n).Therefore, changing C 1 only changes the high-probability bound by a constant.Without loss of generality, we take C 1 = 4 for convenience.
The proof of the first statement is quite similar to the proof detailed in the supplementary materials of (Ke and Wang, 2024).The only difference is the existence of the additional factor N/N i .Thanks to the condition that N i 's are at the same order, it is not hard to see that M 0 (j, j) ≍ n −1 n i=1 d 0 i (j),where the RHS is exactly the definition of M 0 in (Ke and Wang, 2024).Thus, the proof follows simply under Assumption 2. To obtain the large-deviation bound, the following representation is crucial: where The RHS is a sum of independent random variables, thus allowing the application of Bernstein inequality.The inequality (18) is not provided in the supplementary materials of (Ke and Wang, 2024), but it follows easily from the first statement.We prove (18) in detail below.
By definition, it suffices to claim that: simultaneously for all 1 ≤ j ≤ p.To this end, we derive: M (j, j)( M 0 (j, j) + M (j, j)) Using the large-deviation bound |M (j, j) − M 0 (j, j)| ≤ C h j log(n)/(N n) = o(h j ) and also the estimate M 0 (j, j) ≍ h j , we bound the denominator by: with probability 1 − o(n −3 ), simultaneously for all 1 ≤ j ≤ p. Consequently: where the last step is due to h j ≥ h min ≥ C/p.This completes the proof of ( 18).The next Lemma presents the eigen-properties of the population data matrix.
Lemma 4 (Lemmas F.2, F.3, and D.3 in (Ke and Wang, 2024)) Suppose the conditions in Theorem 1 hold.Let G 0 be as in ( 13).Denote by λ 1 ≥ λ 1 ≥ . . .≥ λ K the non-zero eigenvalues of G 0 .There exists a constant C > 1 such that: Furthermore, let ξ 1 , ξ 2 , . . ., ξ K be the associated eigenvectors of G 0 .Then: The above lemma can be proved in the same manner as those in the supplement materials of (Ke and Wang, 2024).Given our more general condition on Σ A , which allows its smallest eigenvalue to converge to 0 as n → ∞, the results on the eigenvalues are slightly different.In out setting, only the largest eigenvalue is of order n and it is well-separated from the others as the first eigenvector of n −1 G 0 has multiplicity one, which can be claimed by using Perron's theorem and the last inequality in Assumption 2. For the other eigenvalues, they might be at the order of β n in Assumption 2. The details are very similar to those in the supplement materials of (Ke and Wang, 2024) by adapting our relaxed condition on Σ A , so we avoid redundant derivations here.
Throughout the analysis, we need matrix Bernstein inequality and decoupling inequality for U-statistics.For readers' convenience, we provide the theorems below.
Theorem 6 Let X 1 , • • • , X N be independent, mean zero, n×n symmetric random matrices, such that ∥X i ∥ ≤ b almost surely for all i and ∥ N i=1 EX 2 i ∥ ≤ σ 2 .Then, for every t ≥ 0, we have: The following two theorems are special cases of Theorem 3.4.1 in (De la Pena and Giné, 2012), which implies that using decoupling inequality simplifies the analysis of U-statistics to the study of sums of (conditionally) independent random variables.
Theorem 7 Let {X i } n i=1 be a sequence of i.i.d.random vectors in R d , and let { X i } n i=1 be an independent copy of {X i } n i=1 .Then, there exists a constant C > 0 independent of n, d such that: , for 1 ≤ i ≤ n and 1 ≤ m ≤ N , be a sequence of i.i.d.random vectors in R d , and let { X (i) m } i,m be an independent copy of {X (i) m } i,m .Suppose that h : R 2d → R is a measurable function.Then, there exists a constant C > 0 independent of n, m, d such that: The key difference between the above theorems is attributed to the index set used across the sum.In Theorem 7, the random variables are indexed by i and all pairs of (X i , X j ) are included; in contrast, Theorem 8 uses both i and m and consider only the pairs that share the identical index i.However, both are viewed as special cases of Theorem 3.4.1 with degree 2 in (De la Pena and Giné, 2012), which discussed a broader sequence of functions {h ij (•, •)} i,j , where each h ij (•, •) can differ with varying i, j.By assigning all h ij (•, •) to the same product function, we have Theorem 7; whereas Theorem 8 follows from specifying: Appendix B. Proofs of Lemmas 1 and 2

B.1 Proof of Lemma 1
Using the definition of eigenvectors and eigenvalues, we have G Ξ = ΞΛ and G 0 Ξ = ΞΛ.Additionally, since G 0 has a rank K, G 0 = ΞΛΞ ′ .It follows that: As a result: We plug this equality into the first term on the RHS of ( 19) to obtain: for any orthogonal matrix O. Combining this with (19) gives: Fix O = sgn( Ξ′ Ξ).The sine-theta theorem (Davis and Kahan, 1970) yields: We use ( 21) to bound the first two terms on the RHS of (20): the RHS in the second line above dominates the RHS in the first line.We plug these upper bounds into (20) to get: We notice that the second term on the RHS of ( 22) still involves Ξ, and we further bound this term.By the assumption of this theorem, there exists a diagonal matrix Γ such that ∥Γ Additionally, for any vector v ∈ R p and matrix B ∈ R p×K , it holds that ∥v ′ B∥ ≤ j |v j |∥e ′ j B∥ ≤ j |v j |∥B∥ 2→∞ ≤ ∥v∥ 1 ∥B∥ 2→∞ .We then bound the second term on the RHS of ( 22) as follows: Plugging ( 23) into ( 22) gives: where in the last line we have used the assumption that γ j is an upper bound for ∥e We multiply both LSH and RSH of (24) by γ −1 j and take the maximum over j.It gives: We further plug this inequality into (24) to obtain: This proves the claim.

B.2 Proof of Lemma 2
The first claim is the same as the one in Lemma 5 and will be proved there.
The second claim follows by simply collecting arguments in the proof of Lemma 5, as shown below: By ( 41 We apply Lemma 7 to get large-deviation bounds for ∥e ′ j E s Ξ∥ with s ∈ {2, 3, 4}.This lemma concerns ∥e ′ j E s Ξ∥, but in its proof we have already analyzed ∥e ′ j E s Ξ∥.In particular, ∥e ′ j E 2 Ξ∥ and ∥e ′ j E 3 Ξ∥ have the same bounds as in ( 46), and the bound for ∥e ′ j E 4 Ξ∥ only has the first term in (47).In summary: It remains to bound ∥e ′ j E 1 Ξ∥.We first mimic the steps of proving (50) of Lemma 7 (more specifically, the derivation of ( 80), except that Ξ is replaced by Ξ) to obtain: We note that: For s ∈ {2, 3}, we have ∥e ′ j E s ∥ ≤ C h j p log(n)/(N n).This has been derived in the proof of Lemma 7: when controlling ∥e ′ j E 2 Ξ∥ and ∥e ′ j E 3 Ξ∥ there, we first bound them by ∥e ′ j E 2 ∥ and ∥e ′ j E 3 ∥, respectively, and then study ∥e ′ j E 2 ∥ and ∥e ′ j E 3 ∥ directly).We plug these results into (29) to obtain: For ∥e ′ j E 4 (M 1/2 0 M −1/2 − I p )Ξ∥, we cannot use the same idea to bound it as for s ∈ {2, 3}, because the bound for ∥e ′ j E 4 ∥ is much larger than those for ∥e ′ j E 2 ∥ and ∥e ′ j E 4 ∥.Instead, we study ∥e ′ j E 4 (M 1/2 0 M −1/2 − I p )Ξ∥ directly.This part is contained in the proof of Lemma 8; specifically, in the proof of ( 48).There we have shown: We plug ( 31) into ( 30) and note that λ 1 = O(n) and ∥e ′ j Ξ∥ = O(h 1/2 j ) (by Lemma 4).We also use the assumption that N n ≥ N nβ 2 n ≥ p log 2 (n) and the bound for We plug ( 28) and ( 32) into ( 27).This proves the second claim.

Appendix C. The complete proof of Theorem 1
A proof sketch of Theorem 1 has been given in Section 4.4.For the ease of writing formal proofs, we have re-arranged the claims and analyses in Lemmas 1 and 2, so the proof structure here is slightly different from the sketch in Section 4.4.For example, Lemma 5 combines the claims of Lemma 2 with some steps in proving Lemma 1; the remaining steps in the proof of Lemma 1 are combined into the proof of the main theorem.First, we present a key technical lemma.The proof of this lemma is quite involved and relegated to Appendix D.1.
Next, we similarly bound the second term: Here we used the fact that λK ≥ Cnβ n following from (36) and Lemma 4.
To proceed, we multiply both sides in (40) by h −1/2 j and take the maximum.It follows that: Note that p log(n)/ N nβ 2 n = o(1) from Assumption 3. We further rearrange both sides above and get: Plugging the above estimate into (40), we finally conclude the proof of Theorem 1.

Appendix D. Entry-Wise Eigenvector Analysis and Proof of Lemma 5
To finalize the proof of Theorem 1 as outlined in Appendix C, the remaining task is to prove Lemma 5. Recall the definition in ( 13) that: Write D = D 0 + Z, where Z = (z 1 , z 2 , . . ., z n ) is a mean-zero random matrix with each N z i being centered Multinomial (N i , Aw i ).By this representation, we decompose the perturbation matrix G − G 0 as follows: where: Here the second step of ( 41) is due to the identity: which can be obtained by: Throughout the analysis in this section, we will frequently rewrite and use: as it introduces the sum of independent random variables.We use the notation d 0 i := Ed i = ET im = Aw i for simplicity.By (41), in order to prove Lemma 5, it suffices to study: ∥E s ∥ and ∥e ′ j E s Ξ∥/n, for s = 1, 2, 3, 4 and 1 ≤ j ≤ p.
The estimates for the aforementioned quantities are provided in the following technical lemmas, whose proofs are deferred to later sections.
Lemma 6 Suppose the conditions in Theorem 1 hold.There exists a constant C > 0, such that with probability 1 − o(n −3 ): Lemma 7 Suppose the conditions in Theorem 1 hold.There exists a constant C > 0, such that with probability 1 − o(n −3 ), simultaneously for all 1 ≤ j ≤ p: with O = sgn( Ξ′ Ξ).
Lemma 8 Suppose the conditions in Theorem 1 hold.There exists a constant C > 0, such that with probability 1 − o(n −3 ), simultaneously for all 1 ≤ j ≤ p: and furthermore: For proving Lemmas 6 and 7, the difficulty lies in showing ( 45) and (47) as the quantity E 4 involves the quadratic terms of Z with its dependence on Ξ.We overcome the hurdle by decomposing Ξ = Ξ + Ξ − ΞO ′ and employing decoupling techniques (Theorems 7 and 8).
Considering the expression of E 1 , where DD ′ is involved, the proof of (50) in Lemma 8 significantly rely on the estimates in Lemma 7, together with (48) and ( 49).The detailed proofs are systematically presented in subsequent sections, following the proof of Lemma 5.

D.1 Proof of Lemma 5
We employ the technical lemmas (Lemmas 6-8) to prove Lemma 5. We start with (33).By the representation (41), it is straightforward to obtain that: for some constant C > 0, with probability 1 − o(n −3 ).Under Assumption 3, it follows that: Therefore, we complete the proof of (33).

D.2 Proof of Lemma 6
We examine each ∥E i ∥ for i = 1, 2, 3, 4. We start with the easy one, ∥E 2 ∥.Recall D 0 = AW .We denote by W ′ k the k-th row of W and rewrite , we have: We analyze each factor in the summand: where we used the fact that A k (j) ≤ h j for 1 ≤ j ≤ p.Hence, what remains is to prove a high-probability bound for each Z ′ j W k .By the representation (43): We then apply Bernstein inequality to the RHS above.By straightforward computations: and the individual bound for each summand is C/N .Then, one can conclude from Bernstein inequality that with probability 1 − o(n −3−c 0 ): As a result, considering all 1 ≤ j ≤ p, under pn −c 0 ≤ C from Assumption 3, we have: with probability 1 − o(n −3 ).Here, in the first step, we used M 0 (j, j) ≍ h j ; the last step is due to the conditions h j ≥ h min ≥ C/p and p log(n) ≪ N n.Plugging ( 54) and ( 52) into (51) gives: Furthermore, by definition, E 3 = E ′ 2 and ∥E 3 ∥ = ∥E 2 ∥.Therefore, we directly conclude the upper bound for ∥E 3 ∥.
Next, we study E 4 and prove (45).Notice that M 0 (j, j) ≍ h j for all 1 ≤ j ≤ p.It suffices to prove: We prove (56) by employing Matrix Bernstein inequality (i.e., Theorem 6) and decoupling techniques (i.e., Theorem 7).First, write: In order to get sharp bound, we employ the truncation idea by introducing: for some sufficiently large C > 0 that depends on C 0 (see Assumption 3) and 1 E i represents the indicator function.We then have: under the event n i=1 E i .We will prove the large-deviation bound of in the following steps.
(a) We claim that: (b) We claim that under the event n i=1 E i : (c) We aim to derive a high probability bound of n n i=1 X i by Matrix Bernstein inequality (i.e., Theorem 6).We show that with probability 1 − o(n −3 ), for some large C > 0: If (a)-(c) are claimed, with the condition that N < Cn −C 0 from Assumption 3, it is straightforward to conclude that: with probability 1 − o(n −3 ).This gives (45), except that we still need to verify (a)-(c).
In the sequel, we prove (a), (b) and (c) separately.To prove (a), it suffices to show that P( ) ) for all 1 ≤ i ≤ n.By definition, for any fixed i, N i z i is centered multinomial with N i trials.Therefore, we can represent: Then it can be computed that: We write: where: First, we study I 1 .Let { T im } N m=1 be an independent copy of {T im } N m=1 and: We apply Theorem 7 to I 1 and get: It suffices to obtain the large-deviation of I 1 instead.Rewrite: We derive the high-probability bound for T 1 first.For simplicity, write: We apply Bernstein inequality condition on {T im } N i m=1 .By elementary computations: where we used that fact d 0 i (j) = e ′ j Aw i ≤ e ′ j A1 K = h j .Furthermore, with the individual bound N −1 max t {a(t)/ √ h t }, we obtain from Bernstein inequality that with probability 1 − o(n −(2C 0 +4) ): by choosing appropriately large C > 0. We then consider using Bernstein inequality to study a(t) and get: simultaneously for all 1 ≤ t ≤ p, with probability 1 − o(n −(2C 0 +4) ).As a result, under the condition min{p, N } ≥ C 0 log(n) from Assumption 3, it holds that: We then proceed to the second term in (62), Using Bernstein inequality, similarly to the above derivations, we get: The individual bound is given by N −2 /h min .If follows from Bernstein inequality that: with probability 1 − o(n −(2C 0 +4) ).Consequently, by pluging ( 63) and ( 64) into (62) and using Assumption 3, with probability 1 − o(n −(2C 0 +4) ).By (61), we get: with probability 1 − o(n −(2C 0 +4) ).Second, we prove a similar bound for I 2 with: We compute the variance by: This, together with the crude bound: gives that with probability 1 − o(n −(2C 0 +4) ), for some sufficiently large C > 0: under Assumption 3. Combing ( 66) and ( 67), yields that: with probability 1 − o(n −(2C 0 +4) ).Thus, we conclude the claim P(E c i ) = o(n −(2C 0 +4) ) for all 1 ≤ i ≤ n.The proof of (a) is complete.
Next, we show the proof of (b).Recall the second term on the RHS of (57).Using the convexity of ∥ • ∥ and the trivial bound: we get: Here, in the last step, we used the fact that p ≤ n C 0 , which follows from the second condition in Assumption 3.This yields the estimate in (b).Finally, we claim (c) by Matrix Bernstein inequality (i.e., Theorem 6).Towards that, we need to derive the upper bounds of ∥ X i ∥ and ∥E X 2 i ∥.By definition of X i , that is: we easily derive that: for some large C > 0, in which we used the estimate: By the above inequality, it also holds that: Moreover: Since E X i = 0, it follows from Theorem 6 that: with probability 1 − o(n −3 ), for some large C > 0. We hence finish the proof of (c).The proof of ( 45) is complete now.Lastly, we prove ∥E 1 ∥ ≤ C pn log(n)/ √ N .By definition, we rewrite: Decomposing D by D 0 + Z gives rise to: Applying Lemma 4, together with ( 55) and ( 56), we see that: Furthermore, it follows from Lemma 3 that: Combining the estimates above, we conclude that: We therefore finish the proof of Lemma 6.

D.3 Proof of Lemma 7
We begin with the proof of ( 46).Recall the definitions: We bound: by the second inequality in (52).Similarly to how we derived (54), using Bernstein inequality, we have: in view of p log(n) 2 ≤ N n and h j ≥ h min ≥ c/p from Assumption 3. Analogously, for Ξ 3 , we have: where we used Hence, we complete the proof of (46).
In the sequel, we focus on the proof of (47).Recall that We expect to show that: Let us decompose ∥e ′ j E 4 Ξ∥/n as follows: We bound n −1 ∥e ′ j E 4 Ξ∥ first.For any fixed 1 ≤ k ≤ K, in light of the fact that M 0 (j, j) ≍ h j for all 1 ≤ j ≤ p: with: For J 1 , it is easy to compute the order of its variance as follows: where we used the facts that ξ k (t) ≤ √ h t , d 0 i (j) ≤ Ch j , and t d 0 i (t) = 1.Furthermore, with the trivial bound of each summand in J 1 given by CN −2 h −1/2 j , it follows from the Bernstein inequality that: with probability 1−o(n −3−C 0 ).Here, we used the conditions that h j ≥ C/p and p log(n) 2 ≤ N n.
We proceed to estimate |J 2 |.Employing Theorem 8 with: to examine the high probability bound of: where { T im 1 } is an independent copy of {T im 1 }.Imitating the proof of (62), we rewrite: Notice that b im can be crudely bounded by C in view of ξ k (t) ≤ √ h t .Then, condition on { T im 1 }, by Bernstein inequality, we can derive that: . Consequently, we arrive at: under the assumption that h j ≥ C/p.As K is a fixed constant, we further conclude: with probability 1 − o(n −3−C 0 ).Next, we estimate n −1 ∥e ′ j E 4 ( Ξ − ΞO ′ )∥.By definition, we write: For each 1 ≤ t ≤ p: For (I) k , using Bernstein inequality, it yields that with probability 1 − o(n −3−2C 0 ): where the last step is due the the fact p log(n) 2 ≤ N n from Assumption 3. As a result: Here, we used the Cauchy-Schwarz inequality to get: For (II) t , since it is a U-statistics, we then apply the decoupling idea, i.e., Theorem 8, such that its high probability bound can be controlled by that of ( II) t , defined by: where { T i m} i, m is the i.i.d.copy of {T im } i,m .We further express: Condition on { T i m} i, m, we use Bernstein inequality and get: It follows that: where the last step is due to the trivial bound that: for any 1 ≤ m ≤ N .Thus, combining ( 73) and ( 74), under the condition h j ≥ C/p, we obtain: with probability 1 − o(n −3−C 0 ).Moreover, employing the estimate M 0 (j, j) ≍ h j for all 1 ≤ j ≤ p, it follows that: with probability 1 − o(n −3−C 0 ).
In the end, we combine ( 72) and ( 76) and consider all j simultaneously to conclude that: with probability 1 − o(n −3−C 0 ).Combining all 1 ≤ j ≤ p, together with p ≤ n C 0 , we complete the proof.

E.3 Proof of Theorem 4
The optimization in (12) has a explicit solution given by: Notice that (A ′ M −1 0 A) −1 A ′ M −1 0 d 0 i = (A ′ M −1 0 A) −1 A ′ M −1 0 Aw i = w i .Consequently: What remains is to analyze: For T 1 , we bound: Using the estimates: j, j) −1 − M 0 (j, j) −1 ≤ log(n) h j N nh j , it follows that: and similarly: As a result: Next, for T 2 , we bound: where for ( Â − A) ′ M −1 d i , given the low-dimension K, we crudely bound: