SONIC: SOcial Network analysis with Influencers and Communities

The integration of social media characteristics into an econometric framework requires modeling a high dimensional dynamic network with dimensions of parameter typically much larger than the number of observations. To cope with this problem, we intr- oduce SONIC, a new high-dimensional network model that assumes that (1) only few influencers drive the network dynamics; (2) the community structure of the network is characterized by homogeneity of response to specific influencers, implying their underlying similarity. An estimation procedure is proposed based on a greedy algorithm and LASSO regularization. Through theoretical study and simulations, we show that the matrix parameter can be estimated even when sample size is smaller than the size of the network. Using a novel dataset retrieved from one of leading social media platforms — StockTwits and quantifying their opinions via natural language processing, we model the opinions network dynamics among a select group of users and further detect the latent communities. With a sparsity regularization, we can identify important nodes in the network. in simulations. Using StockTwits data and the lexicon-based sentiments, we deploy a SONIC analysis and display an opinion network for Apple and Bitcoin users (nodes). We detect K = 2 communities using stability analysis and identify the influencers subsequently. We discuss the choice of the regularization parameter λ of LASSO and the choice of the number of clusters.


Introduction
A network is defined through a set of nodes and edges with a given adjacency structure. In a social, financial, or econometric context, such networks are often dynamic, and nodes, such as individuals or firms, are changing their activities over time. An analysis of such network dynamics is often based on vector autoregression. Consider a network that produces a time series Y t ∈ R N , t = 1, . . . , T and dependencies between its elements are modeled through the equation (1.1) where W t are innovations that satisfy E[W t | F t−1 ] = 0, F t = σ{Y t−1 , Y t−2 , . . . }, so that the interactions between the nodes are described by an autoregression operator Θ ∈ R N ×N . In terms of the network connections we say that a node i is connected to the node j if Θ ij = 0, so that the nonzero coefficients represent the adjacency matrix of such network, and the sparsity of Θ represents the number of edges. For large-scale time series, one encounters the curse of dimensionality, as estimating the matrix-parameter Θ with N 2 elements requires a significantly large number of observations T . Several attempts to reduce the dimensionality have been made in the past literature. Assuming that the elements of a time series form a connected network, Zhu et al. (2017) introduce a Network Autoregression (NAR) with Θ ij = βA ij / N k=1 A ik , provided that the adjacency matrix A ∈ R N ×N is known. Here, the regression operator, defined up to a single parameter β, which is called the network effect, can be estimated through simple least squares. Zhu et al. (2019) also extend this model for conditional quantiles. Furthermore, Zhu and Pan (2020) argue that a single network parameter may not be satisfactory as it treats all nodes of the network homogeneously. In particular, the NAR implies that each node is affected by its neighbors in the same extent, while in reality, we may have, e.g., financial institutions that are affected less or more than the others (see Mihoci et al. (2020)). Hence they propose to detect communities in a network based on the given adjacency matrix and suggest that the nodes in each community share a separate network effect parameter. Gudmundsson (2018) take a somewhat opposite direction: their BlockBuster algorithm determines the communities through the estimated autoregressive model, which, however, does not solve the dimensionality problem. Apart from this line of work, sparse regularisations have been extensively used, see Fan et al. (2009);Han et al. (2015); Melnyk and Banerjee (2016).
To sum up, we point out the following problems that one may encounter while dealing with vector autoregression in this social media context: • The VAR parameter dimension is significant; one requires even larger time intervals for consistent estimation. Even if one can afford such a dataset, in the long run, autoregressive models may have time-varying parameters, see e.g., Čížek et al. (2009). We, therefore, impose some assumptions on the structure of the operator Θ, so that estimation through moderate sample sizes is possible.
• The NAR model assumes that the adjacency matrix is known. In particular, this is justified for social networks with a stable and natural friendship/follower-followee relationship. For a realistic network of financial institutions, there is no explicitly defined adjacency matrix, and one has to heuristically evaluate it using additional information (identical shareholders, trading volumes) or through analyzing correlations and lagged cross-correlations between returns or risk profiles, see Diebold and Yılmaz (2014) and Chen et al. (2019b). However, there is no rigorous reason to believe that the operator in (1.1) depends explicitly on such an adjacency matrix, see also Cha et al. (2010).
Our main contribution is to propose a new method for modeling social network dynamics, which is a challenging task in the presence of the curse of dimensionality and the absence of knowledge of adjacency matrix. The proposed SONIC -SOcial Network analysis with Influencers and Communities has the following advantages. First, it allows us to identify the hidden figures who mainly drive the opinion generating process on social media. Second, it discovers the hidden community structure. The proposed estimation algorithm uncovers the hidden figures and communities simultaneously until the minimal empirical risk is attained. Third, we discuss the theoretical properties and underpinnings to ensure estimation efficiency. Apart from dimensionality, the social media data are featured with missing observations, bringing another challenge to researchers. The proposed SONIC is therefore equipped with a correction mechanism for missing observations. We demonstrate the applicability of SONIC on a novel social media dataset.
In more detail, the heuristics about the assumptions on SONIC are motivated by social media users' activities and characteristics. Based on well-known user experience on platforms like facebook, twitter, etc., one can assume that some users have significantly more followers than others. Take, for example, celebrities, athletes, analysts, politicians, or Instagram divas. In a network view, these users are the nodes that have much more influence than the rest of nodes: these nodes are thereby defined as influencers. In the framework of autoregression, a node j is an influencer if there is a substantial amount of other nodes i such that Θ ij = 0. Assuming that the number of influencers is limited, we fix only a few columns of matrix Θ to be non-zero. This allows us to concentrate on the connections to the influencers, significantly reducing the number of parameters to be estimated. A similar idea is used in Chen et al. (2018), with a group-LASSO regularisation imposed, yielding a solution with few active columns. Notice, however, that relying on the sparsity alone still requires T > N , see e.g., Fan et al. (2009);Chernozhukov et al. (2020).
It is also well-known that social networks have small communities, with the nodes exhibiting higher connection density or similar behavior inside communities. Zhu and Pan (2020) analyze a more realistic set-up by allowing separate parameters for each community instead of a single network effect parameter. In our notation, the conditional mean of the response of the node i satisfies Therefore, the behavior of the node i is characterized by the coefficients Θ i1 , . . . , Θ iN i.e., the nodes it depends upon. We assume that the nodes are separated into a few clusters such that the nodes from the same cluster share the same dependency structure, which brings a bigger picture into the view: instead of saying that two nodes from the same cluster are more likely to be connected, we say that they connect to the same influencers.
Our primary focus is an application to the opinion dynamics extracted from a microblogging platform dedicated to stock trading, StockTwits (available at https:// stocktwits.com.) For each user, one can quantify the average sentiment score, via a textual analysis, over the messages he posts during the day. Analyzing these highdimensional time series, on the one hand, we can identify influencers -the users whose opinions are overwhelmingly important, and on the other hand, we determine the community structure. One challenge emerges here: the presence of missing observations since sometimes users do not leave any message. We treat this as follows: assume there is an underlying opinion process that follows network dynamics (1.1). However, such an opinion process might be partially observed, given the random arrival of messages from each user, which renders a commonly used model for missing observations that involve masked Bernoulli random variables. The proposed SONIC accommodates this situation. We return to it in detail in Section 3.3.
The rest of the paper is organized as follows. Section 2 introduces readers to the StockTwits platform, describes in detail the available dataset and the process of users' sentiment scores extraction. In Section 3, we first introduce our SONIC model, then describe the estimation procedure and provide a consistency result. In Section 4, we provide simulation results that support the theoretical properties of our estimator. Next, in Section 5, we present and discuss the results of the application of our model to the dataset retrieved from StockTwits. Section 6 concludes. We dedicate Section 7 to the proofs, as well as Sections A, B in the appendix. Readers can find all numerical examples and the codes developed for the SONIC model on www.quantlet.de.

StockTwits
Social media are an ideal platform where users can easily communicate with each other, exchange information, and share opinions. The increasing popularity of social media is evidence of growing demand for exchanging opinions and information among granular users. Among social media platforms, we are particularly interested in StockTwits for several reasons. Firstly, it is a social media platform designed for sharing ideas between investors, traders, and entrepreneurs. It is similar to Twitter but dedicated to the discussion on financial issues. One of the innovations that led to its popularity is a well-designed reference between the message content and the mentioned stock symbols. Conversations are organized around 'cashtags' (e.g., '$AAPL' for APPLE; '$BTC.X' for BITCOIN) that allow to narrow down streams on specific assets. Secondly, users can express their sentiments/opinions by labeling their messages as 'Bearish' (negative) or 'Bullish' (positive) via a toggle button. These are so-called self-report sentiments, and these labeled data permits the use of supervised textual analysis that requires the training dataset.
We use the StockTwits Application Programming Interface (API) to retrieve all messages containing the preferred cashtags. StockTwits API also provides for each message its unique user identifier, the time it was posted within one-second precision and the sentiments declared by users ('Bullish,' 'Bearish,' or unclassified). Among over thousand tickers/symbols, we particularly pick up two symbols, $AAPL for APPLE; $BTC.X for BITCOIN, which represents the most popular security and cryptocurrency, respectively. Concerning the fact that two symbols may attract the investors/users with different degrees of interaction, we may uncover disparate network dynamics. In Table 1, we summarize the messages' statistics and document the generated sentiment series. Firstly, the BTC investors tend to disclose their sentiment, evident by 44% of labeled messages, whereas in AAPL only 28% of messages are labeled. Secondly, an imbalance between the numbers of positive and negative messages shows that online investors are in general optimistic, also found by Kim and Kim (2014) and Avery et al. (2016). As to the average message volume per day, we observe that AAPL certainly attracts more attention than BTC does.

Quantifying message content
Two main methods are used for textual sentiment analysis: the dictionary-based approaches and the machine learning techniques. We opt for the dictionary-based approach in consideration of transparency, comprehension, less computational burden and short texts. StockTwits, like Twitter, limits message length to 140 characters, which further limits the power of a machine learning-based approach concerning little contextual information on the short texts. A dictionary, or lexicon, is a list of words labeled as positive, negative, or neutral. Given such a list, the bag-of-words approach consists of counting the number of positive and negative words in a document in order to assign it a sentiment value or a tone. For example, a simple dictionary containing only the words 'good' and 'bad' with positive and negative labels, respectively, would classify the sentence 'Bitcoin is a good investment' as positive with a tone +1. The simplicity of the dictionary-based approach guarantees transparency and replicability, on the con side, it comes with the limitations on natural language analysis. First, referring to Deng et al. (2017) to the 'context of discourse,' one needs to be aware of the content domain, to which language interpretation is sensitive. For example, Loughran and McDonald (2011) point out that words like 'tax' or 'cost' are classified as negative by Harvard General Inquirer lexicon, whereas they should be considered neutral in the financial context. Another example is about quantifying sentiment on cryptocurrency. Chen et al. (2019a) point out that many domain-specific terms, such as 'blockchain,' 'ICO,' 'hackers,' 'wallet,' and 'binance,' 'hodl,' are not covered in the existing financial and psychological dictionaries. They construct a new cryptocurrency lexicon in response to the need of adopting a specific approach to measure sentiment about cryptocurrencies. The second limitation is about the language domain, which Deng et al. (2017) defines as the 'lexical and syntactical choices of language.' One example would be the difference between newspapers where one mostly finds a formal and standardized tone, and social media, where slang and emojis prevail. As observed, online investors often use new 'emojis' such as (positive) and (negative) when talking about cryptocurrencies. These are missing in the traditional dictionary.
Bearing the aforementioned considerations in mind, in the sentiment quantification for the messages of AAPL we employ the social media lexicon developed by Renault (2017), while in the case of BTC we advocate the lexicon tailored for cryptocurrency asset, by Chen et al. (2019a). Renault (2017) demonstrates that the constructed lexicon significantly outperforms the benchmark dictionaries while remaining competitive with high-level machine learning algorithms. Based on 125,000 bullish and another 125,000 bearish messages published on StockTwits, using the lexicon for social media achieves 90% of classified messages and 75.24% of correct classifications. 1 With a collection of 1,533,975 messages from 38,812 distinct users, posted between March 2013 and December 2018, and related to 465 cryptocurrencies listed in StockTwits 2 , Chen et al. (2019a) documents that implementing the crypto lexicon can classify 83% of messages, with 86% of them correctly classified.
To convert unstructured text into a machine-readable text, we proceed by the natural language processing (NLP) using NLTK toolkit. First, all messages are lowercased. Tickers ('$BTC.X,' '$LTC.X,' ...), dollar or euro values, hyperlinks, numbers, and mentions of users are respectively replaced by the words 'cashtag, ' 'moneytag,' 'linktag,' 'numbertag,' and 'usertag'. The prefix "negtag " is added to any word consecutive to 'not,' 'no,' 'none,' 'neither,' 'never,' or 'nobody'. Finally, the three stopwords 'the,' 'a,' 'an' and all punctuation except the characters '?' and '!' are removed. For each collected message we filter the terms appearing in the designated lexicon, and equally weight the filtered terms to generate the sentiment score of message, which also means that the sentiment score of a message is estimated as the average over the weights of the lexicon terms it contains. Since the weights of the terms lexicon are in the range of −1 and +1, the sentiment scores fall in this range.
To visualize the resulting sentiment scores from individuals over time, we select the top 100 active users and display their daily sentiment scores over time. The heatmap shown in Figure 2.1 is a 2-dimensional matrix with y-axis for user's ID, and x-axis for message posting date, the cell of the heatmap is the quantified sentiment score. The level of sentiment is color-coded, so that the evolution and dynamics of sentiment among users can be read in such a heatmap presentation. It appears that users express diverging opinions over time. From Figure 2.1a (AAPL) or Figure 2.1b (BTC), one observes the similar color codes among a group of users at particular date or period, indicating a contemporaneous and potentially intertemporal dependency among users' sentiment time series. The correlation matrices of users' sentiment time series in Figure  2.2a and 2.2b exhibit an interdependence on sentiment series. In most cases, we observe positive dependencies and the dependencies seem to be centered on a group of users. For those who exhibit negative dependence with others, we may classify them as contrarians, a type of investors whose purchasing and selling decisions are in contrast to the prevailing sentiment.
By aggregating the individual sentiment scores from 26K users in APPL and 25K users in BTC respectively, for each symbol we construct daily aggregate sentiment indicator by averaging out, at a 24-hour interval, the sentiment score of individual messages published per calendar day. Figure 2.3 displays the dynamics of aggregate sentiment on APPL and BTC. Such sentiment dynamics may be featured with the hidden community structure and perhaps are driven by a small subset of users. For the sake of brevity, in Table 1, we only report the summary descriptive statistics for the aggregate sentiment indicator. In the case of AAPL (BTC), the mean and the standard deviation of sentiment indicator are 0.285 (0.292) and 0.478 (0.397), respectively. Again, it shows that the sentiment on social media is quite positive. Compared to the sentiment on AAPL, the sentiment on BTC is more exuberant and relatively volatile. Moreover, if A is a N × N matrix and Λ 1 , Λ 2 ⊂ [N ] are two subsets of indices, we denote the submatrix A Λ 1 ,Λ 2 = (A ij ) i∈Λ 1 ,j∈Λ 2 . We also write for short A Λ,· = A Λ, [N ] and Furthermore, for a vector a ∈ R d denote a square matrix diag{a} ∈ R d×d that has the values a 1 , . . . , a d on the diagonal and zeros elsewhere. For a square matrix A ∈ R d×d we denote Diag(A) ∈ R d×d as a diagonal matrix of the same size that coincides with A on the diagonal, i.e., Diag(A) = diag(A 11 , . . . , A dd ). For the off-diagonal part we use the notation Off(A) = A − Diag(A).
For a real vector x ∈ R d and q ≥ 1 or q = ∞ denote the q -norm x q = (|x 1 | q + · · · + |x d | q ) 1/q ; for q = 2 we ignore the index, i.e., x = x 2 ; we also denote the pseudo- denote the non-trivial singular values of A. We will also refer to σ min (A) as the least nontrivial eigenvalue, i.e., σ min (A) = σ min(d 1 ,d 2 ) (A). Furthermore, we write |||A||| op = max j σ j (A) for the spectral norm and |||A||| F = Tr 1/2 (A A) = min(p,q) j=1 σ j (A) 2 1/2 for the Frobenius norm. Additionally, we introduce element-wise norms A p,q for p, q ≥ 1 (including ∞) denotes q norm of a vector composed of p norms of rows of A, i.e.,

The structure of operator Θ: Influencers & communities
In our set-up, the behavior of each node i ∈ [N ] is characterized by the coefficients Θ i1 , . . . , Θ iN , and when we group the nodes using their characteristics the notion of community is merged with the notion of cluster. We assume that the nodes are separated into clusters, such that these coefficients remain quantitatively comparable for the nodes within each cluster. Let us first give a precise definition of a clustering.
Definition 3.1. A K-clustering of the set of the nodes [N ] is called a sequence C = (C 1 , . . . , C K ) of K subsets of [N ], such that • any two subsets are disjoint C i ∩ C j = ∅ for i = j; • the union of subsets C j gives all nodes, Two clusterings C and C are equivalent if the corresponding clusters are equal up to a relabeling, i.e., there is a permutation π on {1, . . . , K}, such that C j = C π(j) for every j = 1, . . . , K. Furthermore, define a distance between two clusterings as Remark 3.1. The distance between clusterings is, in fact, the minimal amount of node transferring from one cluster to another, that is required to make the clusterings equivalent. To see this, notice that each clustering can be defined as a sequence (l 1 , . . . , l N ) of N labels taking values in {1, . . . , K}, so that each cluster is defined as C j = {i : l i = j}. Then, if the clustering C corresponds to the labels l 1 , . . . , l N , the distance between them equals to We specify our model by imposing assumptions concerning the communities and the presence of influencers.
Definition 3.2. We say that Θ ∈ SONIC(s, K) (SOcial Network with Influencers and Communities) if • each user is influenced by at most s influencers, i.e., whenever i, i are from the same cluster C l , l = 1, . . . , K.
We will also say that Θ has clustering C.
Once Θ ∈ SONIC(s, K) has clustering C = (C 1 , . . . , C K ), the following factor representation takes place Θ = Z C V , (3.1) where Z C , V are N × K matrices such that (1(1 ∈ C), . . . , 1(N ∈ C)) ∈ R N -a normalized index vector for the cluster C and Z C Z C = I K ; i.e., only a few nodes are active and carrying information; We present a schematic picture of what we expect in Figure 3.1. Here, the nodes from the same clusters are subject to the same influencers (the grey nodes may be in any of the clusters), which also coincides with the idea of Rohe et al. (2016), who looks for the right-hand side singular vectors of the Lagrangian in a directed network, grouping the nodes affected by the same group of nodes.
The equation (3.1) is akin to bilinear factor models, which appear in the econometric literature as a model with factor loadings, see e.g., Moon and Weidner (2018) and the references therein. It is also a popular machine learning technique for low-rank approximation, see a thorough review in Udell et al. (2016). Chen and Schienle (2019) use sparse factors for a closely related model. We also mention the line of work (Kapetanios et al., 2019;Parker and Sul, 2016;Pesaran and Yang, 2020) with a similar notion of dominant units, but in contrast with our analysis, they are defined through modeling cross-sectional dependencies.

Missing observations
where Y it is the response of a node i = 1, . . . , N at a time t = 1, . . . , T and contaminated with missing observations. Instead of specifying the exact distribution under the parametric model (1.1), we assume there is a true parameter Θ * ∈ R N ×N and some unknown probability measure P with the expectation E, such that under this measure the time series follows the autoregressive equation . For the sake of simplicity, we additionally assume that W t are independent and have Var(W t ) = S under P. Once |||Θ * ||| op < 1 the process exists as a converging series and the covariance of the process reads as For simplicity, we consider sub-Gaussian vectors W t , as it allows us to have deviation bounds for covariance estimation with exponential probabilities. Recall the following definition, that appears, e.g., in Vershynin (2018).
where for a random variable X ∈ R we denote Estimating SONIC is not impeded by the presence of missing data that appear to be one of the features of social media data. We adopt the framework of Lounici (2014) for vectors with missing observations, assuming that each variable Y it is independent and only partially observed with some probability. Formally speaking, instead of having a realization of the whole vector Y t , we only observe the masked process Z t defined as where δ it ∼ Be(p i ) are independent Bernoulli random variables for every i = 1, . . . , N and some p i ∈ (0, 1], which means that each variable Y it is only observed with probability p i independently from other variables, with δ it = 1 corresponding to the observed Y it and δ it = 0 to the unobserved Y it . Obviously, the case p i = 1 for every i = 1, . . . , N corresponds to the process without missing observations. Therefore, the framework constituted by (3.5) serves as a generalization of dynamic network models.
Remark 3.2. In terms of the StockTwits world, we interpret the process Y t as an unobserved underlying opinion process. Such an opinion process quantified from the messages is subject to random arrival of messages, as users disclose their opinions randomly on social media. Although one may restrict the sample to the case of full observation, the statistical inference may be questionable. Also, discarding nodes with very few missing observations is a waste of available information. Given the fact that some users are more active than others, we need to account for different probabilities p i .
Notice that in general the probabilities p i are not known, but can be easily estimated through the frequenciesp i = T −1 T t=1 1[Y it = 0]. Setp = (p 1 , . . . ,p N ) . Following Lounici (2014), we denote the observed empirical covariance Σ * = T −1 T t=1 Z t Z t and consider the following covariance estimator, This estimator is motivated by the fact that EΣ * ii = p i Σ ii and EΣ * ij = p i p j Σ ij for i = j in the case of independent observations. The state-of-the-art bound for the error of such covariance estimator is inspired by Klochkov and Zhivotovskiy (2020), Theorem 4.2. In the case of independent vectors Y t and equal probabilities of observations p 1 = · · · = p N = p they show that for any u ≥ 1 with probability at least 1 − e −u it holds wherer(Σ) = Tr(Σ) |||Σ|||op denotes the effective rank of the covariance Σ. Similarly, the effective rank appears as well in the classic covariance estimation problem (i.e., p = 1), see, e.g., Koltchinskii and Lounici (2017) who even provide a matching lower bound. Notice that the effective rank takes values between 1 and the rank of Σ. However, if there is no specific restriction on the spectrum of Σ, the effective rank can grow as large as the full dimension N , which means that the bound above can only guarantee the error of order N T p 2 , not taking into account the logarithms. On the other hand, one only needs to bound the error within specific low-dimensional subspaces. Say, given two projectors P , Q of rank lower than N , one needs to bound the error which can be significantly smaller than the total error |||Σ − Σ||| op . For example, if we are interested in the error of estimation of Σ Λ,Λ , where Λ ⊂ [N ], the corresponding projectors would have the form P = Q = i∈Λ e i e i . Notice that this projector will be sparse, in the sense that most of its values will be zeros, when |Λ| is much smaller than N . In fact, due to the unknown probabilities p i , which we estimate via the frequencies, the "sparsity" of projectors P, Q will play an important role as well. We define it below.
Definition 3.4. Let P ∈ R N ×N be a symmetric projector, i.e. P 2 = P . Let Λ ⊂ [N ] be the smallest set such that P ij is nonzero only for indices i, j ∈ Λ. Then, we refer to the value |Λ| as the sparsity of P .
Remark 3.3. We employ this technical condition to state bounds for the error of the covariance estimator with missing observations. The corresponding diagonal projector Π Λ = i∈Λ e i e i commutes not only with P , but also with any other diagonal operator, in particular, with diag{p} −1 . Thus, with the help of this larger projector (obviously, Rank(P ) ≤ |Λ|) we can take into account the error that comes from the estimated frequencies.
The following theorem provides a deviation bound for the autoregressive process (3.2). Unlike the bound of Klochkov and Zhivotovskiy (2020), it accounts for possibly distinct probabilities p i .
Theorem 3.5. Assume the vectors W t are independent L-sub-Gaussian and also Let P, Q ∈ R N ×N be two arbitrary orthogonal projectors of ranks M 1 , M 2 and with sparsities K 1 , K 2 respectively. Suppose, that u > 0 is such that Then, it holds with probability at least 1 − e −u that where C = C(γ, L) only depends on L and γ.
See proof of this result in Section A. Additionally, we are interested in estimating lag-1 cross-covariance under the same scenario. Namely, based on the sample Z 1 , . . . , Z T and given the estimated probabilities

Consider the following estimator
where A * is the observed empirical cross-covariance For this estimator, we provide an upper-bound, again with a restriction to some lowdimensional subspaces.
Theorem 3.6. Under conditions of Theorem 3.5, it holds, with probability at least 1−e −u , that where C = C(γ, L) only depends on γ and L.
We postpone the proof to Section A.

Alternating minimization algorithm
In order to estimate the matrix Θ = Z C V , we need to estimate both C and V simultaneously. Suppose that we have some clustering C at hand and we aim to estimate the corresponding V . The mean squared loss from the fully observed sample is: where we used the fact that Z C Z C = I K and the trace of a matrix product is invariant with respect to transition Tr(AB) = Tr(BA). Here, we also denotẽ to be empirical covariance and empirical lag-1 covariance built on a sample Y 1 , . . . , Y T , respectively, which we observe only partially. In reality, the feasible estimators areΣ and A, which we have introduced in the previous section. A natural solution is to plug-in these estimators into the expression (3.7) instead of the unobservedΣ andÃ. The last term 2 does not depend on the parameters C and V at all; therefore, we can drop it. We end up with the following risk function that we need to minimize, In particular, it is not hard to derive from Theorems 3.5 and 3.6 that for any fixed pair 2 are close with high probability.
As we are searching for a sparse matrix V , we additionally impose a LASSO regularization and end up with the following convex optimization, where V 1,1 = ij |v ij |, and tuning parameter λ > 0 depends on the dimension N and number of observations T . Concerning this minimization problem, we have the following observations: • the problem reduces to simple quadratic programming and therefore can be efficiently solved; Therefore, we need to solve K independent problems of size N , which reduces computational complexity, and may therefore be implemented in parallel.
Ideally, we want to solve the following problem (note that the number of clusters K and the tuning parameter λ are fixed) We can employ a simple greedy procedure. In the beginning, we initialize C (0) = (l 1 , . . . , l N ) randomly; each label takes values 1, . . . , K. Then, at a step t, we try to change one label of a node that reduces the risk the most, in other words, we try all the clusterings in the nearest vicinity of the current solution C (t) , i.e., At each such step, we would need to calculate F λ (C) for O{N (K−1)} different candidates.
Remark 3.4. In general, it is impossible to optimize an arbitrary function f (C) with respect to a clustering. The K-means is well-known to be NP-hard, however, different solutions are widely used in practice, see Shindler et al. (2011) and Likas et al. (2003).
To speed up the trials of the greedy procedure, we utilize an alternating minimization strategy. Suppose, in the beginning, we initialize the clustering by C (0) and compute the LASSO solution V (0) = V C (0) ,t . When updating the clustering, we fix the matrix V = V (t) and solve the problem where only the term −Tr(V Â Z C ) depends on C. Minimizing by conducting a few steps of the greedy procedure we obtain the next clustering update C (t+1) . Then, we again update the V -factor by setting V (t+1) = V C (t+1) ,λ . We continue so until the clustering does not change or the number of iterations exceeds a specific limit. The pseudo-code in Algorithm 1 summarizes this procedure.

Local consistency result
In this section, we show the existence of a locally optimal solution in the neighborhood of the true parameter with high probability. We call a clustering solutionĈ locally optimal if the functional F λ (·) in (3.8) has the minimum value at pointĈ among its nearest neighbours d(C,Ĉ) ≤ 1. In particular, Algorithm 1 stops at such a solution.

Conditions
Here we describe the conditions that we need for the consistency result. The first condition concludes the requirements of Theorems 3.5 and 3.6.
Furthermore, we impose assumptions on the structure of the true parameter Θ * described in Section 3.2.
Assumption 2. The true VAR operator admits decomposition with K-clustering C * and meets the following conditions: for some a 0 > 0; 3. sparsity: for every j = 1, . . . , K the active set Here each v * j ≤ 1 has at most s nonzero values, hence the normalization; 5. significant cluster sizes: for some α ∈ (0, 1) it holds Notice that the condition (3.9) corresponds to an appropriate separation of clusters, i.e., each v * j is far enough from a linear combination of the rest. Another assumption imposes conditions on the population covariance Σ.
Assumption 3. The covariance of Y t reads as where S = Var(W t ), and it is assumed that Note that we do not require that the smallest eigenvalue of Σ is bounded away from zero, but only those corresponding to the small subsets of indices are. Such assumption is not too restrictive. In fact, Σ −1 Λ j Λ j would correspond to the Fisher information if we were estimating the vector v j knowing the cluster C * j and the sparsity pattern Λ j in advance.
For the sake of simplicity, we additionally assume that the ratio is bounded by some constant κ ≥ 1. Additionally, we can treat the values L, γ, a 0 , τ 0 , and α as constants. Below we focus on to what extent the relationship between N, T, s, K, and the probabilities of the observations p i , i = 1, . . . , N allows consistent estimation of the parameter Θ. Finally, we present the assumption that allows controlling the exact recovery of sparsity patterns for the LASSO estimator.
Assumption 4. For every j = 1, . . . , K it holds Remark 3.5. Zhao and Yu (2006) call the inequality Σ Λ c j ,Λ j Σ −1 Λ j ,Λ j 1,∞ < η with constant η ∈ (0, 1) the strong Irrepresentable Condition. To avoid technical burden, we pick a concrete constant η = 1/4. In a special case with fixed design and no noise, Tropp (2006) shows that the inequality Σ Λ c j ,Λ j Σ −1 Λ j ,Λ j 1,∞ < 1 is necessary in order to be able to recover the sparsity pattern of v j . In Section B, we show a straightforward extension of Tropp's sparsity recovery results to the case with random design and missing observations.
We are now ready to state our main theorem.
Remark 3.6. In the above theorem we only show the existence of a local minimum of the functional F λ (C) defined in (3.8) near the true clustering C * and, in addition, the statistical properties of the corresponding estimatorΘ λ . This is not uncommon in the machine learning literature when dealing with non-convex bilinear models, see e.g. Gribonval et al. (2015). In addition, we do not guarantee that the algorithm converges to the global minimum. Similarly, Chen et al. (2021); Chen (2014) offers a procedure that is only guaranteed to arrive at a local minimum, and suggest to run the algorithm several times to ensure that the global solution is covered.
Let us discuss this result. According to the theorem, a greater λ gives greater error once it is in the required range. This comes naturally, as the result is based on the exact recovery, see e.g., Tropp (2006). Ideally, we want to choose the smallest available value, (3.13) In this case, the error of the estimator reads as where C does not depend on N, T, K, s. Notice that in a hypothetical situation where the clustering C * is known precisely, we only need to estimate the matrix V that consists of at most Ks non-zero parameters. Therefore, according to Lemma 7.7, the LASSO estimator must give us where we used the fact that Z C * has orthonormal columns; see also Melnyk and Banerjee (2016) and Han et al. (2015). We may say in a loose way that not knowing the exact clustering provides an estimator that is at most √ K times worse. Let us take a closer look at condition (3.11). Under the cluster size restriction from Assumption 2, we have that all clusters have the size of order N/K, since Therefore, if we ignore missing observations, we only need with some constant c depending on α, enabling the estimation toward the parameters. So, once K is large enough, the estimator works with the corresponding error. Notice that the 1 -regularisation alone requires the number of the observations to be at least the number of edges times log N , see Fan et al. (2009). In our setting, the number of connections is up to N s, hence such a condition reads as Therefore, the SONIC model is an improvement in this regard. Finally, we point out that the conditions of Theorem 3.7 imply some limitations on the size of the network concerning the number of observations. Indeed, using the first part of condition (3.11) and comparing the lower-and upper-bounds of condition (3.12), we can easily derive where c > 0 is a constant that only depends on L, γ, a 0 , τ 0 , and α. Though we do not state that this condition is necesarry, it is clear that in some cases the estimation is possible even when N > T .

Simulation study
The theoretical properties of the SONIC model and the developed theorems can be further supported via simulation. We check the discussed theorems and properties via relative estimation errors and cluster errors. We particularly discuss the choice of regularization parameter and number of clusters before turning to the StockTwits applications.
We set up the simulations as follows. Take N = 100 and s = 1, while K will vary in the range 5...25. For every K = 5, 10, 15, 20, 25 we construct the following matrix Θ * , • pick clusters C * j having approximately the same size N K ± 1; with a single nonzero value at the place j, so that s = 1.
• by construction we have, As for the sample size, we consider two scenarios: (a) with T = 100 and p i = 1, i.e., no missing observations; (b) with T = 400 and p i = 0.5, i.e., each Y it is observed with probability 0.5.
In order to generate the autoregressive process, we take i.i.d.
where due to 0.5 −20 ≈ 10 −6 the terms for k > 20 can be neglected. In Figure 4.1 we show the relative error E|||Θ − Θ * ||| F /|||Θ * ||| F along the regularization paths for different choices of K. Picking the best λ, we show the relative error against the number of clusters in Figure 4.2. We also show that the clustering error Ed(Ĉ, C * ) in Figure 4.3 is subject to the choice of K. All expectations are estimated based on 20 independent simulations. Evidently, within the considered range of cluster numbers, larger ones lead to a smaller relative error as well as smaller clustering error. The simulations partially confirm the discussion in the end of the previous section, namely, that the conditions of Theorem 3.7 can be met when K is large enough, although not too large. In addition, we can see that     SoNIC simulation study 1. In our case of missing observations, the value T must be replaced by T p 2 min , the effective number of observations. Furthermore, Wang and Samworth (2018) recommend to disregard multiplicative constants that appear in theory in front of σ log N /(T p 2 min ) (see equation (3.13)) since it leads to consistent, but rather conservative estimation.
The simulation results support this choice. Let us take a look at the regularisation paths in Figure 4.1 for different values of K. All of the graphs that we show exhibit similar behavior: with λ increasing, the evaluated expected relative loss drops until it reaches its minimum, then it starts to increase until it reaches the constant value that corresponds toΘ λ = 0, which obviously happens once the regularization is big enough. Typically, the "oracle" choice corresponds to the minimizer of the expected loss E|||Θ λ − Θ * ||| F . In order to compare it with the recommended choice above, for each K = 5, ..., 25, we pick the tuning parameter (among the available choices on the graph) that delivers the minimum to the evaluated expected loss. In Figure 4.4 we show the values of the best λ for each K = 5, ..., 25 (blue line) and compare it to the heuristic value log N T p 2 min (red line). We observe that once the number of clusters is large enough (K ≥ 15), the corresponding optimal choice of λ approximately equals to log N T p 2 min . On the other hand, as the graph in Figure 4.3 suggests, for K ≤ 10 the number of nodes assigned to a wrong cluster grows significantly, and one cannot estimate the model with any given regularization parameter.
Remark 4.1. In practice, one must evaluate the noise level σ in a data-driven way (Belloni and Chernozhukov, 2013). We suggest to evaluate it using the spectrum of the covariance estimatorΣ. One obvious choice can beσ = Σ . However, this may lead to an overestimated noise level. We suggest using the following strategy. Since Σ = Θ * Σ(Θ * ) + S, we expect the original covariance to have either K or K − 1 spikes (one cluster could be zero). In particular, this is true whenever S = σI. We therefore suggest using the singular valueσ = σ K (Σ), which means that we skip the first K − 1 components. The resulting regularisation parameter reads as In the next section, we stick to this strategy.

Choice of number of clusters K via stability analysis
In the simulation study above we fixed a priori the number of clusters. When applying SONIC to empirical data, this is rarely the case. One possible way to decide the number K is to analyze the stability of the clustering algorithm (Rakhlin and Caponnetto, 2007;Le Gouic and Paris, 2018). The idea is that if we guess the number of clusters correctly, then on different subsamples we should get similar results. On the other hand, if our guess is wrong, we can end up with randomly split or glued clusters. In other words, the resulting clustering will be unstable with respect to the change of the sample. We therefore propose the following procedure. Consider a sequence of intervals I 1 , . . . , I l ⊂ {1, . . . , T } of the same length and let us estimate the clusteringsĈ I j using the observations (Y t ) t∈I j for each j = 1, . . . , l. If the number of clusters is correct, we expect that the pairwise distancesĈ j are small. We take l = 6 intervals of length 3T /4 ± 1, each of the form so that we include all available observations. We then calculate the distances d(Ĉ I 1 ,Ĉ I j ) for each j = 2, . . . , l and for different choices of K. We suggest to choose the number of clusters that has small distances d(Ĉ I 1 ,Ĉ I j ) when compared to the total number of nodes in the network. We demonstrate how the picture can look in the following simulation scenarios: On Figure 4.5 we present the results obtained from one realisation for each scenario. Each graph (a)-(d) contains the corresponding scenario, with T = 100, 200, 500, 1000, 2000 marked with different colors from left to right. In Figure 4.5a, in the case where the true number of clusters is K = 2 and p min = 1, at first we do not see any stability. Although, the clustering errors corresponding to the correct guess K = 2 may be smaller, they are still rather large when compared to the total number of nodes. Only for T = 2000 the clustering distances become small (up to 4), and we can clearly see that there is only two clusters. Figure 4.5b shows the results for K = 2 and p min = 0.5. Since the effective number of observations is T p 2 min , the considered numbers of observations are not enough in this case. Figure 4.5c shows the results for K = 5 and p min = 1. Here, we can see stable estimation of the clustering for T = 1000, 2000, which means that it requires twice as smaller the observations than in the case K = 2. Notice that in the case T = 2000, the correct case K = 5 shows the smallest distance between clusterings obtained from different windows. For K = 6 it is still rather small, but the choice is incorrect. Figure 4.5d shows the results for K = 5 and p min = 0.5. Effectively, the number of observations reduces by four times, and we can see the similarity between the graph for T = 2000 and for T = 500 in Figure 4.5c, as well as somewhat resemblance between T = 1000 in Figure 4.5d and T = 200 in Figure 4.5c. We can see that none of the graphs in Figure 4.5d demonstrates stability due to the lack of simulated observations.
In conclusion, we suggest to look for the smallest number of clusters that shows a "reasonably" small clustering difference for different windows, in the sense that it is much smaller than the total amount of nodes. However, at this point we are not able to provide any statistical explanation of what is a "reasonable" clustering distance. The stability analysis we suggest should be used as a qualitative heuristic.

Application to StockTwits sentiment
Here we present the applicability of SONIC to the dataset described in Section 2. We look at the two (AAPL and BTC) networks comprising of users' sentiment time series. These two symbols, representing the most popular security and cryptocurrency respectively, may reveal disparate characteristics, thereby distinct network dynamics featured with different communities and influencers.
To ensure that the model is applicable in the real world, we require that the observations are persistent with the same probability p i over the considered time period. Moreover, since in Theorems 3.5 and 3.6 the amount of observations scales with the factor p 2 min , we need to avoid the users whose p i is too small. We propose the following criteria in sample selection to account for missing observations.
1. pick users with estimated probabilityp i ≥ 0.6 for BTC andp i ≥ 0.8 for AAPL to reflect the fact that the message volume of AAPL per day doubles that of BTC in Table 1; 2. select the most extended historical interval over which the user exhibits persistent probability of observation. One can look at a moving average estimation and ensure that for any window it remains within the appropriate confidence interval;  SoNIC stability simulation 3. take only the users whose historical interval from step 2 is at least 50 weeks.
Equipped with these criteria, for the AAPL dataset, we are left with 36 users and 82 weeks, while for BTC, we have 53 users and 78 weeks. Note that concerning missing observations and the presence of outliers or noisy, the weekly sentiment series averaging out the daily sentiment series is employed.
We apply our SONIC model to the AAPL dataset. We set λ = 0.08 according to Section 4.1 and Remark 4.1. As for the number of clusters, we perform the analysis described in Section 4.2 and present the results in Figure 5.2a for K = 2, 3, 4, 5, 6. Based on these results we suggest to pick K = 2 with maximum clustering distance 3 out of 36 users in total. We present a heatmap visualization for the estimated matrixΘ in Figure 5.1a, where we identify the candidates of influencers with the identification numbers 619769, 850976, 5, 962572, 526780, 473512. To parallel our identification with the indicators from social conventions in terms of what ought to possess as influencers e.g. the number of followers, we analyze the social network profiles of selected users including the register date of membership, the number of followers, the number of ideas, liked count, etc.
To retrieve users' social profiles, we use the StockTwits API toolkit to request the users' message streams and profiles.We stratify the retrieved data and particularly focus on the number of followers, the number of ideas, and the liked count, in hopes of these selected characteristics to comply with the social consensus in terms of the notion of influencers. Table 2 summarizes influencers' social profile and reports the corresponding percentile rank among a pool of users.
The identified users appear to either attract many followers or behave actively, provided with tremendous ideas (posts) or liked count. The first three influencers represent the trading companies offering technical and fundamental analysis for the symbols of interest. It shows that investment companies or financial industry entrepreneurs target their potential customers appearing on social media and influence them strategically. The latter two are financial analysts or trading consultants, and they may serve for a small group of users.
As to the BTC dataset, applying the proposed strategy, we end up with λ = 0.21 and, using the results in Figure 5.2b, we choose K = 2. Figure 5.1b displays the estimated matrixΘ and identifies the influencers 398367 and 969971. Likewise, we elicit their social profile data and document the relevant features in Table 2. The first one is an investment company with a specialization on crypto assets, while the second one is a crypto specialist updating price information and producing the technical analytics to cryptocurrency traders. Both broadcast tactical trading information and update these frequently.
We notice that for anyone relying on these social characteristics may oversimplify the task of identifying influencers. One should be aware that some users with much more followers or ideas may not be able to surpass those being identified via our approach in terms of opinions' importance. In the case of BTC, those who have specialized themselves in crypto-assets may lend themselves to serve a relatively smaller group of people with specific trading preference, albeit not attracting granular followers.

Prediction performance compared with other methods
To highlight the advantages of the proposed model, we compare the prediction accuracy of our method with other benchmarks. We consider the following prevalent benchmarks,   Figure 5.2: Stability analysis for AAPL and BTC datasets. For each K = 2, 3, 4, 5, 6 on the x-axis, we plot five points corresponding to the clustering distances d(Ĉ I 1 ,Ĉ I j ) on the y-axis, with sampling windows as in (4.1).  We report the number of followers, the number of ideas and the like count tagged to each specific user ID. The value in parenthesis is the corresponding percentile rank among a pool of users.
each of which considers missing observations.
• VAR with missing observations: whereΣ andÂ are the covariance and cross-covariance estimators, respectively (recall the definition from Section 3.3); • Lasso VAR with missing observations: where we choose the same λ as in SONIC; • Constant estimator Θ = 0, which corresponds to no correlation across time.
In our exercise, we split the available sample -82 weeks for AAPL and 78 weeks for BTC -into the train and test subsamples, approximately 70% to 30%. Measuring prediction error on data with missing observations, we stumble into the same problem. Ideally, we want to access the value, 1 where T test is the number of observations in the test sample andΘ is estimated on the train sample. Observe that (similar to (3.7)), and we suggest to replace 1 Ttest−1 Ttest t=2 Y t Y t and 1 Ttest−1 Ttest t=2 Y t−1 Y t withΣ test and A test , respectively, which are the covariance and cross-covariance estimated from the test sample. To sum up, we evaluate the prediction performance by, Tr(Σ) − 2Tr(ΘÂ) + Tr(ΘΣΘ ) .
The results are presented in Table 3. We find that, in terms of the prediction performance, SONIC is slightly better that the sparse VAR for the AAPL dataset and as good as the sparse VAR for the BTC dataset. The regular VAR blows up in both cases, which is not surprising given the dimension and the sample sizes in each case. The similarity of the SONIC and sparse VAR shows that the number of clusters K = 2 is too small to benefit from our model in terms of performance. Notice that the condition (3.11) of Theorem 3.7 is likely to break for small number of clusters. However, the fact that SONIC is not worse than sparse VAR confirms that the model we propose indeed reflects the dynamics of a real sentiment based network. In addition, we compare the results with the constant estimatorΘ = 0, which corresponds to the no causality case. We see that in both cases the loss is higher than that of the SONIC model.

Conclusion
Nowadays the interest in dynamics of interaction among the users emerging in social media is dramatically growing. Social media become an attractive venue where users can easily and instantly interact with others. The research in this strand is, however, challenging. From an econometric point of view, these dynamics require effective state-ofthe-art methodologies that cope with the curse of dimensionality, as well as characterize psychological interdependence. From a quantitative perspective, with the textual analysis, the text-based information distilled from Twitter or StockTwits social networks boils down to a numerical expression of sentiment or opinions. The joint evolvement of sentiment variables from individuals constitutes a dynamic network with a possibly growing dimension.
In order to cope with dimensionality in a limited observation setting, we propose SONIC (SOcial Network analysis with Influencers and Communities). SONIC characterizes the social network dynamics and interdependence featured with identified influencers and detectable communities. We provide and discuss several theoretical results on the asymptotic consistency of the dynamic network parameters, even when observations are missing. We propose an estimation procedure based on a greedy algorithm and LASSO regularization that we extensively test in simulations.
Using StockTwits data and the lexicon-based sentiments, we deploy a SONIC analysis and display an opinion network for Apple and Bitcoin users (nodes). We detect K = 2 communities using stability analysis and identify the influencers subsequently. We discuss the choice of the regularization parameter λ of LASSO and the choice of the number of clusters.

Proof of the main result
This section is devoted to the proof of Theorem 3.7. We start with some preliminary lemmata and then proceed with the proof that consists of several steps. Following the ideas in Gribonval et al. (2015), the proof relies on explicit representation of the loss function.
We exploit the following simplified notation. Denote, z * j = z C * j to be the columns of Z * = Z C * and we also denote n * j = |C * j | for every j = 1, . . . , K. When the clustering C = (C 1 , . . . , C K ) is clear from the context we will also write Z for Z C , z j for z C j , and n j = |C j | for every j = 1, . . . , K.

Preliminary lemmata
Proof. Suppose, n j = |C j | > n * j = |C * j |, then since |C j ∩ C * j | ≤ n * j . Thus, √ n j − n * j ≤ (r 2 /2) √ n j , which due to r ≤ 0.3 implies by rearranging and taking square n j ≤ 1.1n * j . If n j < n * j we have, , and the fact that r ≤ 0.3 implies n * j ≤ 1.1n j .
Proof. 1) We first consider the case |C j | = n * j . It holds then Moreover, for every k = j it holds Summing up, we get It is left to notice that in the case |C j | = |C * j | = n * j we have exactly z j − z * j 2 = 1 n * j |C j C * j |. 2) Suppose, n j = |C j | > n * j . Obviously, we can decompose C j = C j ∪ B such that |C j | = n * j and B ∩ C * j = ∅. Setting z j = z C j we get by the above derivations that Taking the remainder b = z j − z j we have, We show that the latter is at most 2.05α −1/2 r 2 . Indeed, it is not hard to show that from n j ≤ 1.1n * j (see Lemma 7.1) it follows thus [Z * ] (z j − z * j ) 1 ≤ 3.05α −1/2 r 2 and the result follows.
3) The case n j < n * j can be resolved similarly to the previous one. Since |C * j \ C j | ≥ n * j − n j we can pick a subset B ⊂ C * j \ C j of size d = n * j − n j and set C j = B ∪ C j with |C j | = n * j ; set also z j = z C j . Then, we have Thus, by the first part of this proof it holds It is left to notice that Proof. Denote z j = z C j and r j = z j − z * j . It holds, By Lemma 7.3 we have for every j = 1, . . . , K thus inequality follows.

Proof of Theorem 3.7
The proof consists of several steps, each represented by a separate lemma.
Lemma 7.6. Suppose, Assumption 1 holds and let N ≥ 2. There is a constant C = C(γ, L), so that if max(2, s log 2 T, n * ) log N T p 2 min ≤ 1 9 , (7.1) then with probability at least 1 − 1/N and for with ∆ 1 = Cσ max log N T p 2 min the following inequalities take place for every j = 1, . . . , K Proof. By Theorem 3.6 for any pair a, b ∈ R N with a ≤ 1, b ≤ 1 it holds with probability ≥ 1 − N −m , Suppose for a moment that m is such that so that we can neglect the second term. In order to meet the condition (3.6) we also need to have, as well as for every j = 1, . . . , K We have |A 0 | ≤ N 2 , |B 0 | ≤ N K and |A j | ≤ sN, |B j | ≤ sK for j = 1, . . . , N , so since s, K ≤ N together they have not more than 4N 3 pairs of vectors (a, b), each having norm bounded by one. In addition, each (a, b) ∈ A j has a 0 ≤ s and b 0 = 1, whereas each a 0 ≤ s, b 0 ≤ n * . In the worst case, we need max(2, s, n * , √ n * s log T ) log(4N ) + m log N T p 2 min ≤ 1.
Taking a union bound, we have that the inequalities (7.2) and (7.3) hold with probability at least 1 − 4N 3−m . By analogy, we can show that (7.4) and (7.5) hold with probability at least 1 − 4N 3−m .
As for the last inequality, for every j = 1, . . . , K pick P j = i∈Λ j e i e i , i.e., projectors onto the subspace of vectors supported on Λ j . Then by Theorem 3.5 it holds with probability at least 1 − KN −m for every j = 1, . . . , K (taking into account (7.7)) The sparsity condition is satisfied once max(2, s log T ) log(4N ) + u T p 2 min ≤ 1.
The total probability will be at least 1 − 8N 3−m − KN −m , which is at least 1 − 1/N whenever m ≥ 7 and N ≥ 2, and both sparsity conditions are satisfied for m = 7.
In what follows we use the additional notation. For a vector v ∈ R N let sign(v) ∈ {−1, 0, 1} N denotes the vector consisting of coordinates, We writes j = sign(v * j ) for each j = 1, . . . , K. In addition, s * j = (s j ) Λ j , which only consists of the values ±1 since Λ j is the support of v * j . In the following, we apply the technique from Gribonval et al. (2015). Suppose that the LASSO solutionv j for a given clustering C is not only supported exactly on Λ j , but its signs are matching those of the true v * j . Let s j ∈ {−1, 0, 1} N be the vector consisting of the signs of coordinates of v * j , i.e. −1 for negative, 1 for positive, and zero for the zero coordinates of v * j . Then, v j 1 =s j (v j ) Λ j . Therefore, we can write and plugging this solution into the risk function we get that F λ (C) = Φ λ (C), where the latter is defined explicitly The next lemma shows that such representation takes place in the local vicinity of the true clustering C * .
Consider the function, The following lemma shows how this function grows with C retreating from the true clustering C * .
Lemma 7.8. Suppose, C is a clustering such that r = |||Z C − Z * ||| F ≤ 0.3. Then, Let us first deal with the termΦ 0 (C) where the minimum is taken s.t. the restrictions supp(v j ) ⊂ Λ j . Dropping the restrictions we get,Φ It is not hard to calculate that the minimum is attained for V = [Θ * ] Z C and thereforē where the latter follows using Θ * = Z * [V * ] and from the fact that λ min ([V * ] ΣV * ) ≥ σ 0 . Moreover, where we used the fact that Tr(P C ) = Tr(P C * ) = K. It is left to recall the result of Lemma 7.4, so that we get As for the linear term, it holds We have where, since v * k is supported on Λ k of size at most s,

Summing up, we get
The lemma now follows from the two terms put together.
The next step is to bound the difference Φ λ (C)−Φ λ (C) uniformly in the neighbourhood of C * . Lemma 7.9. Suppose that the inequalities (7.2)-(7.6) hold and let First of all, due to (7.6) it holds, Then by Cauchy-Schwartz, Going further, First notice, that due to Lemma 7.2 and (7.2) it holds, Therefore, it follows Moreover, using (7.3) we get and we also have |||Σ −1 Λ j ,Λ j ||| op ≤ 2σ −1 min due to the condition σ −1 min √ s∆ 1 ≤ 1/2. Thus we get that the first sum of (7.10) is bounded by while the second sum is bounded by where we used the fact that max j v * j ≤ |||V * ||| op = |||Θ * ||| op < 1 together with the condition of the lemma ∆ 1 ≤ σ max . Combining all the bounds we get where by r ≤ 0.3 and √ sn * ∆ 1 ≤ σ max we can neglect the third and the fourth power, respectively, and thus the required bound follows.
Lemma 7.10. There are numerical constants c, C > 0 such that the following holds.
Suppose, the inequalities take place: Let Cσ max log N T p 2 min ≤ λ ≤ cσ min τ 0 s −1 , and set Then under the inequalities (7.2)-(7.6) the clusterinĝ Proof. It is not hard to see that for ∆ 1 = log N T p 2 min the inequalities required by Lemmata 7.7-7.9 are satisfied for r ≤r due to (7.11) and conditions on λ andr. Since Sincer ≤ 0.2 √ α implies 10α −1 r 2 ≤ 1 3 , it holds by (7.11) Therefore, after dividing by r, we get that such optimal clustering must satisfy Recalling that |||V * ||| F ≤ √ K, ∆ 1 = Cσ max log N T p 2 min , and ∆ 2 = C s log N T p 2 min yields the result. Now we are ready to finalize the proof of Theorem 3.7. Firstly, we need to show that the clusteringĈ from the lemma above is locally optimal. By Lemma 7.5, any neighbouring to it clustering C satisfies |||Z C − ZĈ||| F ≤ 2 √ αN/K . Therefore, and it is enough to check that this value is at mostr. We check that each of the terms is at mostr/2. For the first one, it is sufficient to have and both are satisfied due to the upper bound λ ≤ cκ −4 (a 2 0 /σ max )K −2 s −1 and the requirement sn * log N T p 2 min ≤ c. For the second term we need both are satisfied once N ≥ Cα 2 K and λ ≥ Cσ max α −3/2 K N . Moreover, by Lemma 7.7 we have forΘ = ZĈVĈ ,λ which finishes the proof. Van de Geer, S., Bühlmann, P., Ritov, Y., and Dezeure, R. (2014). On asymptotically optimal confidence regions and tests for high-dimensional models. The Annals of Statistics, 42 (3) A Proof of Theorems 3.5 and 3.6 Recall that we have a time series, where W t ∈ R N , t ∈ Z are independent vectors with EW t = 0 and Var(W t ) = S. We also have |||Θ||| op ≤ γ for some γ < 1, and the covariance Σ = Var(Y t ) reads as We have the observations where δ it ∼ Be(p i ) are independent Bernoulli random variables for every i = 1, . . . , N and t = 1, . . . , T and some p i ∈ (0, 1]. The proofs of both statements are based on the following version of the Bernstein matrix inequality, which does not require bounded summands. Recall, that for a random variable X ∈ R the value denotes a ψ j -norm. For j = 1 the norm is referred to as subexponential and for j = 2 as sub-Gaussian, see Definition 3.3. Theorem A.1 (Klochkov and Zhivotovskiy (2020), Proposition 4.1). Suppose, the matrices A t for t = 1, . . . , T are independent and let M = max t |||A t ||| op ψ 1 is finite. Then, where σ 2 = ||| T t=1 EA t A t ||| op ∨ ||| T t=1 EA t A t ||| op and C is an absolute constant. Both Lounici (2014) and Klochkov and Zhivotovskiy (2020) assume that the probabilities of the observations are given. Using Chernov's bound for the differencê and applying the union bound, we derive that with probability at least 1 − 1 2 e −u it holds that Notice that what appears on the right-hand side of the above display is dominated by the error that appears in Theorems 3.5 and 3.6. Consider the auxiliary estimators Then, we have forÎ = diag{p} −1 diag{p} that Σ =ÎDiag(Σ) +ÎOff(Σ)Î,Â =ÎÃÎ.
Proposition A.2. Under the conditions of Theorems 3.5 and 3.6, for any two projectors with P, Q with ranks M 1 , M 2 , respectively, we have that for any u > 0, with probability at least 1 − e −u , Moreover, with probability at least 1 − e −u we have that, Here, C = C(γ, L) only depends on γ and L.
We first derive Theorems 3.5 and 3.6 from the above proposition. Then the rest of the section will be devoted to the proof of the above proposition.
Proof of Theorems 3.5 and 3.6. Observe that The last term of the right-hand side is controlled by (A.5). As for the first one, let Λ be the support of P in accordance with Definition 3.4, and set Π Λ = i∈Λ e i e i , so that P = P Π Λ and Rank(Π Λ ) = K 1 . Moreover, Π Λ is diagonal, therefore, Π ΛÎ =ÎΠ Λ . This, we have By (A.5) and (3.6) we have that with probability at least 1 − 1 8 e −u , Let Λ be the sparsity pattern for the projector Q and Π Λ is the corresponding diagonal projector. Then we apply (A.6) to Π Λ (Off(Σ) − Off(Σ))Π Λ ) so that provided with (3.6), we have with probability at least 1 − 1 8 e −u , Using that Π Λ , Π Λ commute with the diagonal matricesÎ,Î − I, and given that δ ≤ 1, Applying (A.5) to P (Diag(Σ) − Diag(Σ))Q with probability 1 − 1 8 e −u and (A.6) to P (Off(Σ) − Off(Σ))Q, and putting the diagonal and off-diagonal terms together, we get that, with probability at least 1 − 4 8 e −u , it holds that It remains to notice that the bound (A.4) for δ holds with probability at least 1 − 1 2 e −u , and the corresponding δ is dominated by the remaining error term. This concludes the proof of Theorem 3.5.
Theorem 3.6 can be proved treating P (Â − A)Q similarly to the off-diagonal case above.
We now turn to the proof of Proposition A.2. Let δ t = (δ t1 , . . . , δ tN ) denotes the vector with Bernoulli variables from above corresponding to the time point t. In what follows we consider the following matrices, Therefore, the decomposition takes place and, similarly, We first deal with the diagonal term. Let P = M 1 i=1 u j u j be its eigen-decomposition with u j = 1, then where each term in the latter is bounded by γ 2k due the fact that |||diag(u j )||| F = 1. Summing up and taking square root, we arrive at |||P diag(x δ )||| op ψ 2 ≤ √ C M 1 γ k . Taking into account similar bound for Qdiag(y δ ), we have by Hölder inequality which yields the bound for the diagonal. As for the off-diagonal, consider first the whole matrix, and since Off(A j,k t,t ) = A j,k t,t −Diag(A j,k t,t ), the bound follows from the triangular inequality.
The following technical lemma will help us to upper-bound σ 2 in Theorem A.1.
Lemma A.4. Let δ 1 , . . . , δ N consists of independent Bernoulli components with probabilities of success p 1 , . . . , p N and set p min = min i≤N p i . Let a, b ∈ R N be two arbitrary vectors. It holds, Additionally, if δ 1 , . . . , δ N are independent copies of δ 1 , . . . , δ N , it holds Proof. It holds, To show the second inequality we use decoupling (Theorem 6.1.1 in Vershynin (2018)) and the trivial inequality (x + y) 2 ≤ 2x 2 + 2y 2 , Denote for simplicity δ i = δ i − p i and δ i = δ i − p i . Since the latter are centered we have, note that the expectation Eδ i δ k is only non-vanishing when i = k, in which case it holds Eδ 2 i = p i − p 2 i . Taking into account similar property of Eδ j δ l we have that the sum above is equal to It is left to note that which recalling (A.11) and noting that 32(p −1 min − 1) 2 + 4 ≤ 32p −2 min for p min ∈ [0, 1], completes the proof.
Similarly to (A.12), we can show the third inequality.
Now we apply the Bernstein matrix inequality to the sum S kj defined in (A.8), dealing separately with diagonal and off-diagonal parts. After that, we present the proof of Theorem 3.5.
Lemma A.5. Under the assumptions of Theorem 3.5, it holds for any u ≥ 1 with probability at least 1 − e −u Proof. Note that, Let E δ denotes the expectation w.r.t. the Bernoulli variables and conditioned on everything else. Setting a = (x 1 γ 1 , . . . , x N γ N ) and b = (y 1 u 1 , . . . , y N u N ) , we have by the first inequality of Lemma A.4, Observe that, so since Tr(diag{γ} 2 ) = 1 and due to (A.9) and by Theorem 2.1 Hsu et al. (2012), it holds E 1/2 a 4 ≤ a 2 ψ 1 ≤ C γ 2k . Similarly, it holds E 1/2 a 4 ≤ C γ 2j , which together implies Now notice that A t is not necessary an independent sequence, as A t depends directly on (W t−k , W t−j , δ t ), which might intersect with t = t + |j − k|. However, if we take a set I ⊂ [1, T ] such that any two t, t ∈ I satisfy |t − t| = |j − k| then the sequence (A t ) t∈I is independent. We separate the whole interval [1, T ] into two such independent sets, Indeed, if for t, t ∈ I 1 then t/|j − k| and t /|j − k| are either equal or differ in at least two, so that in the first case we have |t − t | < |j − k| and in the second |t − t | > |j − k|. Since both intervals have at most T elements, it holds by Theorem A.1 with probability at least 1 − e −u for both j, so summing up the two and dividing by T , we get the result.
Lemma A.6. Under the assumptions of Theorem 3.5, it holds for any u ≥ 1 with probability at least 1 − e −u where C = C(K) only depends on K.
Proof. It holds, Again, using the notation x = Θ k W t−k , y = Θ j W t−j and a = diag{γ}x, b = diag{u}y, we have Off(A j,k t,t ) = Off(xy ). Therefore, by Lemma A.4 From the proof of Lemma A.6 we know that E a 2 b 2 ≤ C γ 2k+2j . Moreover, we have i a i = γ x and i b i = u y. Thus, by (A.10) it holds E 1/4 γ x 4 ≤ γ x ψ 2 ≤ C γ j and, similarly, E 1/4 u y 4 ≤ C γ k . Putting those bounds together and applying Cauchy-Schwarz inequality, we have By analogy, Applying the same sample splitting (A.13) we obtain the bound which divided by T provides the result.
Proof of (A.5). Setting, D k,j = diag{p} −1 Diag(S k,j ), by Lemma A.5 for any u ≥ 1, holds with probability at least 1 − e −u . Take a union of those bounds for every k, j with u = u k,j = k + j + 1 + u for arbitrary u ≥ 0. The total probability of complementary event is at most By definition, Diag(Σ) = diag{p} −1 i,j≥0 S k,j . Due to (A.8) and since EΣ = Σ, it holds on such event which completes the proof due to the equalities k,j≥0 Proof of (A.6). This works similarly to the above, but applying Lemma A.6 to D k,j = diag{p} −1 Off(S k,j )diag{p} −1 and using the fact that Off(Σ) = j,k≥0 D j,k by definition.
Proof of (A.7). Recall the definition, Then, it holds A k,j t,t+1 , and the decomposition takes place, We first apply the Bernstein matrix inequality for each S k,j separately. Observe that By Lemma A.3 each term satisfies max t |||B t ||| op ψ 1 ≤ C M 1 M 2 γ k+j .
Therefore, we get that Using similar derivations we can arrive at Now we separate the indices t = 1, . . . , T into four subsets, such that each corresponds to a set of independent matrices B t . Since each B t is generated by W t−k , W t+1−j , δ t , and δ t+1 , we need to ensure that none of the pair of indices t, t from the same subset satisfies |t − t | = |k − j + 1| nor |t − t | = 1. It can be satisfied by the following partition. First, we split the indices into two subsets with odd and even indices, respectively, so that none of the subsets contains two indices with |t − t | = 1. Then, both of the subsets need to be separated into two according to the scheme (A.13), so that the assertion |t − t | = |k − j + 1| is avoided within each subset. Therefore, applying the Bernstein inequality, Theorem A.1, to each sum separately and summing them up, we get that for any u ≥ 1 with probability at least 1 − e −u , Similarly to the proof of Theorem 3.5, we take the union of those bounds for every i, j with u = j + k + u and then the result follows.

B LASSO and missing observations
Suppose, we observe a signal y ∈ R n of the form where Φ = [φ 1 , . . . , φ p ] ∈ R n×p is a dictionary of words φ j ∈ R n and b * is some sparse parameter with support Λ ⊂ {1, . . . , p}. We want to recover the exact sparse representation by solving a quadratic program Denote by R Λ the set of vectors with elements indexed by Λ, for b ∈ R n let x Λ ∈ R Λ be the result of taking only elements indexed by Λ. With some abuse of notation we will associate every vector x Λ ∈ R Λ with a vector x from R n that has same coefficients on Λ and zeros elsewhere. Let Φ Λ = [φ j ] j∈Λ be a subdictionary composed of words indexed by Λ, and P Λ is the projector onto the corresponding subspace.
The following sufficient conditions for the global minimizer of (B.1) to be supported on Λ are due to Tropp (2006), who uses the notion of exact recovery coefficient, The results are summarized in the next theorem.
Theorem B.1 (Tropp (2006)). Letb be a solution to (B.1). Suppose that Φ ε ∞ ≤ γERC(Λ). Then, • the support ofb is contained in Λ; • the distance betweenb and optimal (non-penalized) parameter satisfies, In what follows, we want to extend this result for the possibility of using missing observations model. Observe that the program (B.1) is equivalent to so that the minimization procedure only depends on D = Φ Φ and c = Φ y. Suppose that instead we have only the access to some estimatorsD ≥ 0 andĉ that are close enough to the original matrix and vector, respectively, which may come e.g., from missing observations model. Then, we can solve instead the following problem, In what follows, we provide a slight extension of Tropp's result towards missing observations, the proof mainly follows the same steps. Below, for a matrix D and two sets of indices A, B, we denote the submatrix on those indices as D A,B , and for a vector c, the corresponding subvector is c A .
Then, the solutionb to (B.2) is supported on Λ.
Proof. Letb be the solution to (B.2) with the restriction supp(b) ⊂ Λ. SinceD ≥ 0 this is a convex problem and therefore the solution is unique and satisfieŝ where ∂f (b) denotes the subdifferential of a convex function f at a point b, in the case of 1 norm we have g ∞ ≤ 1. Thus, b =D −1 Λ,Λĉ Λ − γD −1 Λ,Λ g. (B.3) Next, we want to check thatb is a global minimizer. To do so, let us compare the objective function at a point b =b + δe j for arbitrary index j / ∈ Λ. Since b 1 = b 1 + |δ|, we have where the latter comes from the fact thatD is positively definite. Applying the equality (B.3) yields, e jDb =D j,ΛD where the right-hand side is nonnegative by the condition of the lemma. Since j / ∈ Λ is arbitrary,b is a global solution as well.
Remark B.1. It is not hard to see that in the exact caseD = Φ Φ andĉ = Φ y the condition of the lemma above turns into the condition Φ Λ c P Λ ε ∞ ≤ γERC(Λ) of Theorem B.1.
Since we are particularly interested in applications to time series, the features matrix Φ should in fact be random, thus stating a ERC-like condition onto it might result in additional unnecessary technical difficulties. Instead, let us assume that there is some other matrixD, potentially the expectation of Φ Φ, such that it is close enough toD (with some probability, but we are stating all the results deterministically in this section), and the value that controls the exact recovery looks like ERC(Λ;D) = 1 − D Λ c ,ΛD −1 Λ,Λ 1,∞ .