Crawling subsampling for multivariate spatial autoregression model in large-scale networks

: In network data analysis, multivariate spatial autoregression (MSAR) models may be used to analyze the autocorrelation among multiple responses. With large-scale networks, the estimation for MSAR on the entire network is computationally expensive. In this case, the subsampling method could be adopted. This approach selects a sample of nodes and then uses the estimate based on the sample to approximate the estimate on the full data. However, traditional sampling methods cannot obtain unbiased parameter estimates. Considering the second-order friend information of sampled nodes, we propose the crawling subsampling (CS) method for a general framework. Thereafter, based on the sampled data only, we construct the least-squares objective function. Under certain conditions, the computational complexity of optimizing the objective function is linear with the sample size n s . The identiﬁcation condition for the parameters on the sampled network is theoretically provided. The sample size order requirement is provided, and the asymptotic properties of the least-squares estimators are investigated. The numerical results for the simulated and real data are presented to demonstrate the performance of the proposed CS method and least-squares estimator for the MSAR model.


Introduction
In recent years, there has been a surge of interest in network data analysis [42,22,23,5].In network data, the behaviors of individuals are influenced by those with whom they have relationships [7].Thus, the responses of different individuals are dependent on the network structure [33].The spatial autoregression (SAR) model has become popular in network analysis to gauge the influence of the network structure on individual responses [1].To analyze the autocorrelation among multiple responses, the SAR model has recently been extended to multivariate cases [43,11], resulting in the multivariate spatial autoregression (MSAR) model.We take Twitter as an example.The tweets on different topics (for example, investment and savings) naturally constitute multivariate responses for each user.Consider the entire network with N nodes.The quasi-maximum likelihood estimation has been proposed to estimate the MSAR model.However, its computational complexity is O(N 3 ) [3,2].Thus, Zhu et al. [46] proposed a novel least-squares estimation (LSE) for MSAR, by means of which the computational complexity could be sharply reduced.

W. Hu et al.
However, in large-scale networks, estimation on the entire network data may be infeasible.As a result, the estimation of network models is typically conducted based on samples [30].We consider the samples collected from the entire network by some well-designed sampling schemes.Obtaining a relatively accurate estimate based on a subsample then becomes an important issue [6].
Subsampling is a classical method proposed by [28].This approach selects the sample nodes and related information, and subsequently uses the estimate based on the sampled data.This method has been re-popularized in large-scale data analysis [24,12,44] for approximating the estimate on the full data.When dealing with independent data, numerous efficient subsampling methods are available for linear regression [25,39] and logistic regression [40,37].For dependent data, such as time-series or network data, the subsampling method is popular but challenging [17,4,9].This is because the sampled network inevitably loses the connections between certain individuals [30].As a result, the consistency of the estimator on the sampled data hardly holds [30].Moreover, sampling methods are typically designed for a particular model [34,18,30,4].
In particular, for SAR-related models, the estimated autocorrelation parameter tends to be negatively biased on the sampled data [7].As a result, appropriate network sampling schemes and estimation methods should be discussed.Zhou et al. [45] proposed the paired maximum likelihood estimator method for sampled data to obtain a consistent estimation for the SAR model.However, as Taylor expansion is conducted, this method requires the autocorrelation coefficient to approach zero, which can hardly be satisfied in all cases.Furthermore, Huang et al. [20] proposed the LSE estimator for the classical univariate SAR model with no covariates and discussed a simple sampled case.However, the required sample size and corresponding proof were not discussed in [20].This served as the inspiration for the current work.It is worth noting that the MSAR model is an extension of the classical SAR model [46].The manner in which to conduct consistent estimation for MSAR models based on sampled data becomes a problem of interest.
In this study, focusing on the MSAR model in large-scale networks, we propose the crawling subsampling (CS) method for a general framework.The proposed approach is an extension of the sampling method in [20].First, we design a general three-step CS scheme.The sample collected can retain the necessary network structure information for estimation.Based on the selected sample, we discuss the parameter identification issue, and use LSE to obtain the estimate for the parameters.The computational complexity of optimizing the objective function is discussed.Thereafter, we provide the sample size order requirement.Finally, we establish the asymptotic properties of the estimator under certain technical conditions.The performance of the proposed method is demonstrated by various simulation examples and data analysis of a third-party restaurant review website.
The remainder of this paper is organized as follows.Section 2 introduces the CS method and investigates the theoretical properties of the LSE based on the subsample.The simulation and real dataset studies are presented in Section 3.
The paper is concluded in Section 4. All the technical details are presented in Appendix A and Appendix B.

Model and notations
For a large-scale network with N nodes, a i1i2 = 1 is defined if there exists a relationship from node i 1 to i 2 (i 1 = i 2 ), and a i1i2 = 0 otherwise, where 1 ≤ i 1 , i 2 ≤ N .Thus, the adjacency matrix is recorded as A = (a i1i2 ) ∈ R N ×N and a i1i1 = 0 is assumed.
Furthermore, the row-normalized adjacency matrix is defined as is the nodal out-degree of node i 1 .For the nodal features, assume that each node i has p-dimensional continuous responses (Y i ∈ R p ) and q-dimensional exogenous covariates (X i ∈ R q ), where 1 ≤ i ≤ N .Herein, we define n s to be a small sample size, and n x is defined as the number of unsampled nodes followed by the sampled nodes.Next, we define Y g = (Y ij ) ∈ R ns×p as the response matrix of the sampled nodes and Y g0 ∈ R (ns+nx)×p as the response matrix of the sampled nodes and their following nodes (i.e., followees).We define X g = (X ik ) ∈ R ns×q as the exogenous covariate matrix of the sampled nodes and E g = (ε ij ) ∈ R ns×p as the noise matrix of the sampled nodes, wherein we assume that ε i is identically and independently distributed with mean 0 p ∈ R p and cov Then, we define the MSAR model on a sampled network.We denote the parameters α = (α j j ) ∈ R p×p and B = (b kj ) ∈ R q×p .Thus, the MSAR model on the sampled network can be expressed as where W g0 ∈ R ns×(ns+nx) in (2.1)only retains the rows of the sampled nodes in W ; i.e., W g0 records the edges between the sampled nodes and their followees.Following [11] and [46], α jj is referred to as the intra-activity effect or endogenous effect, which measures the impact from the same type of response variable of the connected neighbors; α j j (j = j) is known as the extra-activity effect or cross-activity peer effect, which measures the impact from the other response types; b kj (1 ≤ k ≤ q) is referred to as the own effect, which reflects the impact from the exogenous covariates.The detailed presentation of the MSAR model and related notations could also be found in [46].The impact relationship among the responses and exogenous covariates in (2.1) can be interpreted as illustrated in Figure 1.Subsequently, we discuss the parameter space.Let |λ max (α)| be the maximum absolute eigenvalue of the matrix α.Assuming that |λ max (α)| < 1, the matrix (I Np − α ⊗ W ) is invertible (as proven by Lemma 1 of [46]), where I Np is the (Np × Np)-dimensional identity matrix and ⊗ is the Kronecker product.In this study, based on the sample data, we focus on the estimation of α ∈ R p×p and β = vec(B) ∈ R pq on this parameter space {|λ max (α)| < 1}.

Crawling subsampling scheme
In this section, we discuss the detailed procedures of the CS scheme.In the model of (2.1),only the nodes followed by the sampled nodes are considered (i.e., only part of the first-order friends).However, to calculate the least-squares-type objective function, which is defined as follows (refer to (2.2)-(2.4)for further details), the estimation method involves the first-and second-order friends of the sampled nodes.Thus, the proposed CS scheme aims to retain the network structure and the corresponding nodal features necessary for estimation.
Next, we focus on the sampling scheme.When using CS, the sampled data contain nodal features and double-layer supplementary information of the sampled nodes.We define g i as a sampling indicator, with g i = 1 if node i is sampled and g i = 0 otherwise.We define G y = {i : g i = 1, 1 ≤ i ≤ N } as the collection of sampled nodes.The sampling scheme can be implemented in the following three steps.These steps are illustrated in Figure 2, and the related notations are presented in the following part.
STEP1: Obtain the original node set G y via a convenient sampling method, such as simple random sampling.Denote the nodes in G y as sampled nodes.
STEP2: Collect the followers and followees of sampled nodes, which are defined as G x1 and G x2 , respectively.In other words, collect all nodes j that are outside G y but are directly connected to some nodes in G y .Furthermore, collect the corresponding nodal features.The two node sets are expressed as Denote as the supplementary node set, which records the supplementary nodes outside the sampled nodes.Finally, based on the sampled data, we can construct the least-squares objective function in the next section.

LSE based on subsample
In this section, based on the sampled data, we construct the least-squares objective function.Moreover, the computational complexity for optimizing the objective function and the identification issue for the parameters are discussed.
First, we define the best linear predictor (BLP) of the response Y i1j1 only on the sampled data collected above.For any sampled node i 1 ∈ G y and 1 Based on the sampled data, which consist of the nodes in G y G x , necessary connections, and corresponding nodal features, we construct the BLP of Y i1j1 , given Y −(i1,j1) W. Hu et al. [31].The BLP, defined as F Y i1j1 |Y −(i1,j1) , can be obtained as follows: where and Ω e = Σ −1 e .Moreover, Ω e,•j2 , Ω e,j2• represent the j 2 th column and j 2 th row of the matrix Ω e , respectively.Similarly, α j1• and α •j2 represent the j 1 th row and j 2 th column of the matrix α, respectively.The verification of (2.2) is presented in Appendix A.1.Three types of unsampled nodes are used for r i1j1i2j2 = 0 in (2.3), as illustrated in Figure 3.It should be noted that the calculation of any μ i1j1 in (2.2) still involves the nodal exogenous variables of the entire network.However, in the proposed calculation of the objective function (2.4), the mean term μ i1j1 will be eliminated.We can calculate and optimize the objective function based only on the sampled data.3).In this case, i 1 represents the sampled nodes (Gy) and i 2 represents the supplementary nodes (Gx).The three types are (1) nodes i 1 followed by i 2 (w i 2 i 1 = 0), representing i 2 ∈ Gx 1 in the left panel; (2) nodes i 2 followed by i 1 (w i 1 i 2 = 0), representing i 2 ∈ Gx 2 in the middle panel; and (3) i 1 and i 2 indirectly followed by a third node i ( i∈Gy Gx w ii 1 w ii 2 = 0), representing i 2 ∈ Gx 3 in the right panel.As a result, the three types of unsampled nodes involved in (2.3) correspond exactly to the supplementary nodes Gx 1 , Gx 2 , Gx 3 collected by STEP 2 and STEP 3 in Figure 2.
Next, we discuss the objective function based only on the sampled data.For 1 ≤ i ≤ N and 1 ≤ j ≤ p, we define , and G = I p ⊗g.Then, for the sampled nodes, the least-squares objective function can be written as (2.4) The verification of (2.4) is provided in Appendix A.1.It should be emphasized that the G matrix in (2.4) is simply presented for convenience of the proof.
We denote The proof of Theorem 1 is presented in Appendix B.1.According to Theorem 1, in the ideal case, the computational complexity of optimizing the objective function is linear in the sample size n s .Even if the condition is not satisfied, the computational complexity is relatively low because the number of nodes involved is significantly smaller than that of the entire network.It is remarkable that the convergence of Newton-Raphson algorithm is guaranteed by [14] and [29], which requires the objective function to be twice differentiable and the second order derivative to be nonsingular.As a result, under Condition (C1) and the assumption in Theorem 2, which we will introduce later, the algorithm converges.Subsequently, based on the sampled data, we investigate the identification of the parameters.The true parameters of α, β only in this identification part are defined as α 0 , β 0 , respectively.We define ) to be a 1 × p row vector, in which the ith entry is 1 and the others are 0. We define p+q) .Thereafter, we make the following assumption.
Note that condition (C1) corroborates with the identification condition in the existing literature [43,11].The identification on the sampled data can be proven under condition (C1).
Theorem 2. Assume that there exists δ > 0 such that where c s is a positive constant.Then, under condition (C1), in the parameter space {|λ max (α)| ≤ 1 − δ}, the parameters α 0 and β 0 can be identified in the sampled network.
The proof of Theorem 2 is presented in Appendix B.2.According to Theorem 2, α 0 and β 0 can be identified in the sampled network.

Asymptotic property of LSE based on CS
In this section, for the least-squares estimators on the sampled data, we discuss the required sample size.Then, we investigate the asymptotic distribution under the identifiability assumption.First, recall that L is the least-squares objective function, and the sample size is n s .To give the form of the asymptotic distribution, we define We denote α j1j2 as the j 1 th row, j 2 th column element of the matrix α.With respect to α j1j2 and the vector β, we denote the derivatives The other related matrices and vectors are expressed as To establish the asymptotic property based on the crawling subsample, we focus on the following technical conditions on large-scale network data.
(C2) (Network structure) (C2.1) (Connectivity of the entire network) Let the set of all nodes {1, • • • , N} be the state space of a Markov chain, with the transition probability given by W. The Markov chain is assumed to be irreducible and aperiodic.Furthermore, π = (π i ) ∈ R N is defined as the stationary distribution vector of the Markov chain (i.e., π i ≥ 0, Conditions (C2.1) and (C2.2) are two separate assumptions regarding the overall network structure.Condition (C2.1) is similar to those in other studies [20,46].Condition (C2.1) requires certain connectivity for the network structure.If the network is fully connected following a limited number of steps, the irreducibility and aperiodicity can be satisfied simultaneously.This condition is satisfied if the well-known theory of six-degree separation [26] holds.Moreover, δ represents the global influence, typically for a large-scale network, where there is no dominant node (i.e., δ → 1/2).Condition (C2.2) imposes a certain uniformity assumption only on the sampled network (i.e., W g, gW ).(C2.2) requires λ max {g(W + W )g} to be slowly diverging with a rate of O(log n s ).Naturally, condition (C2.2) can be considered as the generalization of the uniformity assumption of the entire network [20,46,21].Next, we discuss the sample size required for the subsampling method.
Proposition 1.Under condition (C2) and (C4), there exists > 0, and we assume that the sample size in G y satisfies Thus, we have that the sample size order is related to the nodes with global influence.The condition of the sample size n s is necessary in Theorem 3. The required sample size n s will be smaller for networks with smaller global influence.Thus, we state the condition as follows: (C3) (Order of sample size) Assume that there exists > 0 such that assume that the following six limits exist: (2.6) In the overall network data, conditions (C4) and (C5) set regularity conditions on the exogenous covariates and noise terms, respectively.In (C6), note that Σ , whereas Σ Theorem 3. Assuming conditions (C1) to (C6), the true parameters are rewritten as θ = vec(α) , β ∈ R p 2 +pq , and θ represents the least-squares estimator.For pq) .Then, in the parameter space {|λ max (α)| < 1}, as n s → ∞, we have where The detailed expression of (2.8) is provided in (2.6).The proof of Theorem 3 is presented in Appendix B.3.According to Theorem 3, θ is √ n s -consistent and asymptotically normally distributed on the sampled data.The estimate is obtained by optimizing (2.4) using the Newton-Raphson algorithm.

Simulation settings
To demonstrate the performance of the parameter estimation on finite sampled data, we compare four sampling methods on three different networks.The first three classical sampling methods are simple random sampling (SRS), snowball sampling (SN), and Metropolis-Hastings random walk (MHRW).The final sampling method is CS.We consider the three traditional sampling methods and the generation mechanism of the simulated data.We define p i to represent the sampling probability of node i.
(1) SRS: A simple random sample is drawn by selecting n s nodes from the total N nodes with equal probability p i = 1/N .Thus, SRS roughly ignores the network topology.
(2) SN: SN [16] is a traditional network sampling method for capturing the korder friends of the initial seed node.The SN sample can be obtained as follows: (i) First, randomly select a node i and collect all the nodes directly connected to i. Denote the sampled node set as y , where t represents the tth step in the sampling.Thereafter, collect the nodes connected to i .Denote the sampled node set as until the sample size reaches n s .
(3) MHRW: MHRW, which is based on random walk sampling, is another network sampling method for collecting nodes adaptively by means of link tracing [32,41].The MHRW sample can be obtained as follows: (i) First, randomly select a node i, with G 1 y = {i}.(ii) Randomly select a node i in G t y from the connected nodes (i.e., j) of node i .MHRW adaptively selects a node with unequal probability p j , where where a dyad is defined as A ij = (a ij , a ji ) for the ith and jth nodes.We set (2) The second example is the stochastic block model [38,27].We define K = 100 as the total number of blocks.Then, we set P (a ij = 1) = 0.9N −1 if the ith and jth nodes are in the same block, and P (a ij = 1) = 0.3N −1 otherwise.(3) The third example is the power-law distribution network [10].The in-degree of node i is denoted as d i (i.e., d i = N j=1 a ji ).Moreover, d i follows the discrete powerlaw distribution (i.e., P , where c n is a normalizing constant.We set the exponent parameter c e = 2.5. Finally, we consider the generation mechanism of the noise matrix E and the exogenous covariate matrix X.We set p = 2, q = 2, α = (0.3, 0; −0.2, 0.1), and β = (−0.5,1.3, 1, 0.3) .For the noise matrix, E i• (i.e., i = 1, ..., N ) is generated independently: (1) multivariate normal distribution with mean (0, 0) and covariance Σ e = (0.4,0.1; 0.1, 0.6) ∈ R 2×2 ; and (2) t-distribution with the same mean and covariance as (1) and a degree equal to 5. For the exogenous covariates of node i, X i• is generated from a multivariate normal distribution with mean (0, 0) and covariance Σ x = (0.5 |j1−j2| ) ∈ R 2×2 .

Performance measurements and simulation results
We compare the different sampling methods in terms of the bias and efficiency of the estimates.Consider the overall network size (i.e., N = 2000) and different sample sizes (n s =200, 500).In each simulation example, we repeat the experiment R = 500 times to compare the performance.For the true parameter α jk (1 ≤ j, k ≤ p), the corresponding estimator is recorded as α(r) jk , where r represents the rth replication (i.e., r = 1, ..., R).For the parameter α jk , the average bias is constructed as . The root mean square error (RMSE) is calculated by is denoted as the {(j − 1)p + k}th diagonal element of the asymptotic covariance matrix in (2.7).Then, the empirical 95% confidence interval for α jk is constructed as CI , where z α is the αth lower quantile of a standard normal distribution.As a result, using the empirical 95% confidence interval, the empirical coverage probability is determined as jk ).Moreover, the average running time with 500 replications can be obtained.
The simulation results are obtained using the four sampling methods.Tables 1 to 2 report the simulation results with 500 replications for examples 1 to 3, with error terms in the normal distribution.As the simulation results are similar for the t-distribution, Tables 4 to 5 only present the results for the CS method with the different examples.
First, by using the CS method, without the entire network data, we only need n s = 200 for the estimates to perform effectively.Secondly, in each simulation model, the maximum bias of the CS method is 0.015.This is because the CS  method leads to an unbiased estimation.As n s increases, the RMSE for each parameter decreases.Moreover, the empirical coverage probabilities of the CS estimators are stable at a level of 95%, which implies that the estimated standard error effectively approximates the true standard error SE jk .In contrast, for the other three traditional sampling methods, the diagonal elements of α are substantially underestimated.Moreover, the empirical coverage probabilities are far from the level of 95%.Thus, we can conclude that it is necessary to consider the related information among the sampled nodes G y (i.e., the double-layer supplementary information).Moreover, in the face of a large-scale network with limited storage and computational resources, our method can obtain consistent estimates on the sampled data.

Third-party restaurant consumer review website data analysis
In this section, we present the CS implementation for the MSAR model on third-party restaurant consumer review website data.On the website, users can comment on the merchant that they visited.We regard the online user behavior features (namely, activity and interaction) as responses.The user activity represents the number of users' comments for a merchant over the past month.The user interaction represents the communications among users, which is measured by the number of likes and comments between users.The log-transformed values of the activity and interaction are aggregated, following which the responses are normalized with mean 0 and variance 1.The histograms of the log-transformed activity and interaction are presented in Figure 4. We collect three user features as the exogenous nodal covariates: the registration time, VIP, and gender (gender = 1 for male and 0 for female).The histograms of the log-transformed registration time are provided in Figure 4.Moreover, 68.4% of users are VIP (VIP = 1 for VIP users) and 98.1% of users are male.All the continuous variables are standardized with mean 0 and variance 1. Subsequently, for the network structure, we consider the adjacency matrix A constructed based on the user comments on merchants.If two users i 1 , i 2 have reviewed the same merchant, a i1i2 = 1 is recorded, and a i1i2 = 0 otherwise.The histograms of the degrees are presented in Figure 4.The nodes with degrees higher than 500 are deleted.Moreover, 5534 users are involved in the real data analysis.
We treat the above N = 5534 nodes and corresponding edges as if they are the entire network.Furthermore, we regard the least-squares estimates on the entire network as true parameters.The estimation results on the entire W. Hu et al.  network are summarized in Table 6.The corresponding p-values are all smaller than 0.01.According to Table 6, the user activity and interaction have a positive intra-activity effect and extra-activity effect.For the exogenous nodal covariates, VIP users and male users exhibit significantly higher activity and interaction.Moreover, it is found that users with an earlier registration time have higher activity but lower interaction.
Thereafter, we use only a small number of nodes to approximate the true parameters.We set the sample size to 300 and 500.For CS, we take the average of the estimates obtained by R = 200 replications.The estimation results are summarized in Table 6.We consider the bias, RMSE, and empirical rejection probability (ERP) as the performance measures.We set the size as α = 0.01.The ERP for α j1j2 is computed as ).The ERP represents the empirical size or power, where αj1j2 is 0 or not.Note that the three traditional sampling methods perform poorly on the bias and ERP.Thus, we focus on the CS results.
We compare the estimates obtained from the sampled data with the true parameters.In Table 6, we abbreviate the activity, interaction, and registration time as ACT, INT, and RET, respectively.The CS performs effectively on a finite sample size.First, we regard the estimates as unbiased because the maximum bias is 0.003.Moreover, as the sample size increases, the RMSE for all parameters decreases.Secondly, for significant true parameters, the ERP approaches 100% on the sampled data when n s = 500.Consequently, CS can lead to a consistent estimation on the sampled data and the double-layer supplementary information from the sampled nodes is necessary for consistent estimation.Furthermore, compared to the neighbors' interaction, the neighbors' activity has a stronger impact on the users' activity and interaction.We suggest rewarding users and encouraging them to leave more reviews on merchants to make the website more active.

Concluding remarks
In this study, to approximate the estimation for MSAR on the entire network accurately, we have proposed the CS method for the LSE estimator in a general framework.Together with the sampled nodal features and double-layer supplementary information, we constructed the least-squares objective function on the sampled data.The identification condition for the parameters on the sampled network was theoretically investigated.Furthermore, the sample size condition and asymptotic properties of the least-squares estimators were provided.Numerical results for the simulated data and real data were presented.
In conclusion, we consider three interesting topics for future studies.First, from a model-based viewpoint, we will focus on the sampling scheme for the MSAR model.Moreover, numerous other models, such as the exponential random graph model and network vector autoregressive model, could also be used to analyze the network data.Thus, to estimate the parameters in these models, extending the CS method could be an interesting research problem.Secondly, for the parameters in different network models and the properties of the network, the bootstrap estimator of the parameters deserves a separate study; see [35,15,8] for further discussions.Thirdly, we have regarded the network structure and nodal features as fixed during the observation period.However, in practice, the network structure, many responses, and exogenous covariates are observed in time series.Thus, a dynamic sampling mechanism and corresponding estimators could be designed.

Appendix A
In Appendix A, we present certain notations and verify the least-squares objective function (2.4) in Appendix A.1.Next, the proof of Proposition 1 is presented in Appendix A.2. (2.4)In this section, we first provide the notations for several useful matrices and vectors.The MSAR model on the entire network can be rewritten into the vector form as Y = α ⊗ W Y + Xβ + E, where X = I p ⊗ X and Y = vec(Y) ∈ R Np .The vector form can also be expressed as Y = S −1 ( Xβ + E), where S = I Np − α ⊗ W . Consider a matrix R with n 1 rows and n 2 columns.Here, R ij represents the ith row and jth column element in matrix R, R i,• is the i 1 th row of matrix R, and R i,• ∈ R n2 .Furthermore, R •,j ∈ R n1 is the jth column of matrix R. We define R i,(−i) as the ith row of matrix R without the ith element, and R i,(−i) ∈ R n2−1 .Similarly, we define R (−i),i ∈ R n1−1 as the ith column vector of matrix R without the ith element, and R (−i),(−i) ∈ R (n1−1)×(n2−1) as R without the ith column and ith row.Furthermore, with respect to α j1j2 and β, the first-order derivatives of R can be written as R α j1j2 and R β .We denote R α j1j2k1k2 , R α j1j2β , and R ββ to represents the second-order derivatives ∂ 2 R/∂α j1j2 ∂α k1k2 , ∂ 2 R/∂α j1j2 ∂β, and ∂ 2 R/∂β∂β, respectively.

A.1. Notations and Verification of
Next, we present the form of the BLP of the Y ij element.When the error term E •j follows a normal distribution; the BLP of Y ij given by Y −(i,j) has the same form as the conditional mean [31].Let n ij = (j − 1) × N + i; then, the BLP of Y ij can be expressed as Moreover, the precision matrix of Y can be expressed as where O = (Ω e α ) ⊗ W and Q = (αΩ e α ) ⊗ (W W ). Specifically, because Next, we split the product in the form of vectors in (A.1) (i.e., Ω nij ,(−nij ) (Y (−nij ) − E{Y (−nij ) })) into the sum of the multipliers of their components.Thus, we have ).Thus, the calculation of each element in r i1j1i2j2 only involve the three types of unsampled nodes in Figure 3.For any node i 1 ∈ G y , since only nodes i 2 ∈ G y G x are involved in (A.1), we can obtain the BLP of Y i1j1 in (2.2)-(2.3)as follows: As a result, the objective function with all N nodes can be expressed as L w (θ) = ||Y − μ c || 2 .Instead of using all N nodes to establish the objective function, we construct the objective function L on the sampled data.The form of L in (2.4) can be written as computational complexity of each nonzero element in μ c is O(1) according to (A.2).Then, there are only n s p nonzero elements in H. Similarly, the computational complexity of the objective function L and its derivatives (first-order and second-order) is linear in the sample size n s .Further, we assume that the Newton-Raphson algorithm converges in a finite number of steps.As a result, the computational complexity is linear with the number of sampled nodes n s .Thus, Theorem 1 is proven.

B.2. Proof of Theorem 2
Proof.In this section, we investigate the identification condition on the sampled network.In the proof of the identification, θ = vec(α) , β ∈ R p 2 +pq represents the estimates obtained by calculation, and θ 0 = vec(α 0 ) , β 0 represents the true parameter.Then, we define , where m g retains the rows and columns corresponding to sampled nodes in m.Letting λ min (M g ) ≥ c M , where c M is a positive constant, we have As n s → ∞, under condition (C1), n −1 s V g M g V g = 0 if and only if θ = θ 0 on the sampled network.We provide the details as follows.
First, we can obtain V g = (I p ⊗ M g ){vec(α 0 ) − vec(α)} + (I p ⊗ X g )(β 0 − β).Moreover, we define v ∈ R Np , Σ * e = Σ e / tr(Σ e ), c m = max{m}.We denote v g ∈ R Np with only n s p nonzero elements, and the elements corresponding to the unsampled nodes in v g are all equal to 0. According to Theorem 2, even if λ min SS ≥ c s , we also have Moreover, there exists a finite constant c e , and λ max (Σ * e ) ≤ c e .
As a result, letting p+q) .Subsequently, lim ns→∞ n −1 s X * g X * g exists and is nonsingular.Finally, as a result, we have lim ns→∞ n −1 s V g M g V g = 0 if and only if θ = θ 0 on the sampled network.This completes the proof.

B.3. Proof of Theorem 3
Proof.In this section, we investigate the asymptotic properties of the leastsquares estimators, defined as θ, in two parts.In PART 1, θ is proven to be √ n s -consistent.In PART 2, the asymptotic normality of θ is proven.PART 1: Consistency.To establish the consistency result, following [13], we demonstrate that, for any > 0, there exists a constant C s > 0 such that Next, we discuss L(θ) and L(θ) in turn.First, we need to prove that lim ns→∞ n −1 s cov{ L(θ), L(θ)} = Σ 1s .We treat the right side of (B.2) as a quadratic function of C s .Then, the linear term coefficient n (1).Next, based on the law of large numbers, we can prove that n −1 s L(θ) → p Σ 2s .Thereafter, we can determine that λ min {n −1 s L(θ)} → p λ min (Σ 2s ) > 0 since Σ 2s is a real symmetric matrix; i.e., the quadratic term coefficient is asymptotically positive.As a result, there exists a sufficiently large C s , and (B.2) is positive with probability 1 when n s goes to ∞.Furthermore, (B.1) holds, and the consistency of θ can be obtained.
Next, we provide the details.In PART 1 -step 1, we present the proof of lim ns→∞ n −1 s cov{ L(θ), L(θ)} = Σ L 1s .In PART 1 -step 2, we present the proof of n −1 s L(θ) → p Σ L 2s .Under condition (C6), we assume that the following six limits exist: The detailed expressions are concluded in (2.6).We denote L α , L β as the first-order derivatives of L with respect to α and β.Similarly, we denote L αβ , L αβ , L ββ as the second-order derivatives of L.

Fig 1 .
Fig 1. Illustration of MSAR model.Assume that node i (Alice) has two different connected nodes (Bob and Carl), and there are two responses (tweets on investment and tweets on saving).The influential variables of Alice's tweets on investment can be divided into four parts, corresponding to the four terms on the right side of (2.1).The first part consists of the tweets on investment of Alice's friends (Bob and Carl), which corresponds to the intra-activity effect; the second part consists of the tweets on saving of Alice's friends, which corresponds tothe extra-activity effect; the third part consists of Alice's educational background and location, which corresponds to the own effect; and the fourth part is the noise term.

Fig 3 .
Fig 3. Three types of unsampled nodes i 2 involved in(2.3).In this case, i 1 represents the sampled nodes (Gy) and i 2 represents the supplementary nodes (Gx).The three types are (1) nodes i 1 followed by i 2 (w i 2 i 1 = 0), representing i 2 ∈ Gx 1 in the left panel; (2) nodes i 2 followed by i 1 (w i 1 i 2 = 0), representing i 2 ∈ Gx 2 in the middle panel; and (3) i 1 and i 2 indirectly followed by a third node i ( i∈Gy Gx w ii 1 w ii 2 = 0), representing i 2 ∈ Gx 3 in the right panel.As a result, the three types of unsampled nodes involved in (2.3) correspond exactly to the supplementary nodes Gx 1 , Gx 2 , Gx 3 collected by STEP 2 and STEP 3 in Figure2.
and d i is the out-degree of node i. Denote G t+1 y = G t y {j}.(iii) Repeat (ii) until the sample size reaches n s .We consider the following three simulation examples for the network structure.The three examples have different generation mechanisms for the network structure A. (1) The first example is the dyad independence network [19],

Fig 4 .
Fig 4. Descriptive analysis regarding real data.Top left panels: histogram of log-activity; top right panels: histogram of log-interaction; bottom left panels: histogram of standardized log-registration time; bottom right panels: histogram of degrees.

Table 1
Simulation results with 500 replications for example 1 (dyad independence network).The error term ε i ∼ N (0 2 , Σe).The estimates on the entire network and average running time are abbreviated as EWN and T, respectively.

Table 2
Simulation results with 500 replications for example 2 (stochastic block network).The error term ε i ∼ N (0 2 , Σe).The estimates on the entire network and average running time are abbreviated as EWN and ART, respectively.

Table 3
Simulation results with 500 replications for example 3 (power-law distribution network).The error term ε i ∼ N (0 2 , Σe).The estimates on the entire network and average running time are abbreviated as EWN and ART, respectively.

Table 4
(5)with 500 replications for three examples.The error term ε i follows t(5)with mean 0 2 and covariance Σe.The results of the average running time are similar to those of the former cases.

Table 5
(5)imates on the entire network with 500 replications for three examples.The error term ε i follows t(5)with mean 0 2 and covariance Σe.

Table 6
CS estimates with 200 replications on real data.The activity, interaction, and registration times are abbreviated as ACT, INT, and RET, respectively.The "*" indicates that the estimates on the entire network are significant under a level of 0.01.The represented estimates on the sampled real data are the mean of 200 replications.The empirical powers of the estimates are reported in brackets alongside the corresponding estimates.The estimates of the intercept term are omitted.