Heterogeneity in Individual Network Analysis: Reality or Illusion?

Abstract The use of idiographic research techniques has gained popularity within psychological research and network analysis in particular. Idiographic research has been proposed as a promising avenue for future research, with differences between idiographic results highlighting evidence for radical heterogeneity. However, in the quest to address the individual in psychology, some classic statistical problems, such as those arising from sampling variation and power limitations, should not be overlooked. This article aims to determine to what extent current tools to compare idiographic networks are suited to disentangle true from illusory heterogeneity in the presence of sampling error. To this end, we investigate the performance of tools to inspect heterogeneity (visual inspection, comparison of centrality measures, investigating standard deviations of random effects, and GIMME) through simulations. Results show that power limitations hamper the validity of conclusions regarding heterogeneity and that the power required to assess heterogeneity adequately is often not realized in current research practice. Of the tools investigated, inspecting standard deviations of random effects and GIMME proved the most suited. However, all tools evaluated leave the door wide open to misinterpret all observed variability in terms of individual differences. Hence, the current paper calls for caution in the use and interpretation of new time-series techniques when it comes to heterogeneity.

Ever since Molenaar (2004) aimed to bring back the individual into scientific psychology once and for all with his classic manifesto, there has been a rise in idiographic research. This rise is mainly fueled by the realization that inter-individual (nomothetic) and intra-individual (idiographic) levels of analysis do not necessarily yield similar results-a concern that has been pointed out numerous times (Bos & Wanders, 2016;Fisher et al., 2018;Hamaker et al., 2005;Kievit et al., 2013;Schuurman et al., 2015;Simpson, 1951). Population heterogeneity is often brought forward as a reason for this lack of overlap: for instance, individuals may differ from each other not only quantitatively but qualitatively, and current research practice struggles to take these differences into account.
Over the years, network analysis has rapidly gained popularity within psychology (for an overview of the literature, see Fried et al., 2017;Robinaugh et al., 2020). In network analysis, psychological constructs are represented by nodes and edges. Nodes indicate variables that play a role in the psychological construct of interest, e.g., symptoms, where edges represent the statistical relationship between these nodes (Borsboom, 2017;Borsboom & Cramer, 2013;Cramer et al., 2010). This statistical relationship depends on the method used: often, edges represent partial correlations or (logistic) regression coefficients. The most common way to estimate a network is by applying a Gaussian Graphical Model (GGM; Lauritzen, 1996)an undirected network model with partial correlations-to cross-sectional data (Epskamp et al., 2016;Robinaugh et al., 2020). As such, the edges in these types of networks represent the strength of the statistical association between two nodes while controlling for every other node in the network. 1 data has become in favor. In this approach, a single individual is measured frequently over an extended period of time, after which a subject-specific network is estimated (Epskamp et al., 2018c). The temporal ordering of time-series data adds two challenges to network estimation: on the one hand, because consecutive time points violate independence assumptions, standard GGM estimation techniques for crosssectional data cannot be used, and on the other hand, the temporal information allows for identifying relationships over time, providing insight into Granger causality (Granger, 1969). Statistical complications arising from violations of independence can be resolved by estimating a temporal network-a network with directed edges that provides information regarding patterns among variables as they unfold over time-in addition to an undirected network containing partial correlations. This latter network is referred to as the contemporaneous network and may provide insight into patterns that occur at a time scale different from the one defined by the spacing of the measurement occasions. Especially within clinical practice, a detailed understanding of the individual and their development over time is deemed to be important, and to this end, these types of networks are seen as a promising tool (Burger et al., 2020).
Intra-individual research using network models has regularly claimed evidence for heterogeneity when comparing individual networks (e.g., Beck & Jackson, 2020;De Vos et al., 2017;Levinson et al., 2022;Piccirillo & Rodebaugh, 2022;Reeves & Fisher, 2020). A common way to analyze heterogeneity within network analysis is by estimating intra-individual network models and using tools to compare the individual network models to one another. In accordance, the observation that network models appear to show differences across individuals is often seen as a vindication of the N ¼ 1 paradigm, as it seems to support the idea that understanding intra-individual processes requires intra-individual data, so that "[I]f one wants to know what happens in a person, one must study that person" (Borsboom et al., 2003).
However, this type of research runs the risk of mistaking noise for heterogeneity by directly interpreting all observed variability in individual network structures in terms of individual differences. After all, not all variability is due to individual differences; some variability is caused by fluctuations within the data due to sampling error and variance sources unrelated to the constructs of interest. As of yet, it is unclear precisely to what extent current metrics used to detect heterogeneity within network analysis are sensitive to such sources of variance.
Therewith, the aim of this paper is twofold: we (1) demonstrate how easily noise can be mistaken for heterogeneity, and (2) we shed light on whether popular tools within network analysis can separate heterogeneity from sampling variation. First, we illustrate the influence of sampling variation when comparing individual network models by means of a thought experiment. Second, we investigate the ability of four "heterogeneity metrics" to separate real from illusionary heterogeneity by means of a simulation study. We provide the reader with a measure to calculate the expected overlap between two estimated networks and leave suggestions for future assessments of network heterogeneity.

Network heterogeneity: a thought experiment
To illustrate the influence of sampling variation within network analysis and its relation to power and heterogeneity, let us conduct a simple thought experiment. Imagine that we have two data-generating network models: One fully connected network containing five nodes and 10 edges and one empty network with the same number of nodes. 2 We refer to these networks as our true underlying networks. For a graphical representation of the two networks, see the left upper and lower panel in Figure 1. Further, assume all edges in the network are of equal strength, and our hypothetical sample is entirely homogeneous, i.e., each individual making up our sample has the same true underlying network structure. Now suppose we have a sensitivity of 50% and a specificity of 90%. That is, we have a probability of 50% to detect an edge that is truly present within our network, and we have a probability of 90% to reject an edge that is not truly present in our network. These numbers are what can be expected when t ¼ 50 given previous simulation studies on individual network models (Epskamp et al., 2018c;Mansueto et al., 2022).
With these assumptions in hand, let us focus on the fully connected network first. Given a sensitivity of 50%, we can expect to pick up, on average, five out of the 10 true edges. Assuming all edges are equal in strength, roughly 50% of the edges will show up in our generated network model at random. Now the question is, if we randomly select three individuals 2 For simplicity, we will use undirected contemporaneous networks and ignore temporal effects by assuming temporal networks are empty. The thought experiment could easily be extended to temporal networks as well, as the same logic holds. from our homogeneous sample, how much overlap between these three individual network models do we expect to see?
When visually inspecting the individual networks (upper panel Figure 1), one may be tempted to conclude there are plenty of individual differences, as none of the resulting networks look similar to one another. To calculate the overlap between the estimated edges, we could take the number of shared edges between two networks, divided by the number of possible edges (nðn À 1Þ=2, in which n stands for the number of nodes). Multiply by 100 to convert the ratio to a percentage. Using this as a metric to calculate the resemblance between two individual networks, the overlap between network A and B is 20%, the overlap between network A and C is 10%, and between B and C is 40%. If we were to repeat this scenario, on average, we expect to see an overlap of 25% between two randomly selected estimated individual networks, even though the true underlying data generating network structure is exactly the same.
Moving on to the case where our true underlying network is empty, we see even less of an overlap between the generated individual network models (see lower panel Figure 1). Remember, we operate with a specificity of 90%. This implies we have a 10% chance of finding an edge that is absent in reality. The fact that there are 10 potential edges means we expect, on average, one false edge to be included in each estimated network model. Here again, assuming all edges are equal in strength, this erroneous edge can show up anywhere in our network. Once more, let us generate a network for three individuals, see the lower panel Figure 1. When visually inspecting these networks, again, our initial conclusion would be that there is a great deal of variety in the resulting networks. If we were to calculate the overlap between the networks, as we did for our fully connected case, we would now find an overlap between our three data-generating network models containing five nodes, with, in the upper panel 10 edges, i.e., a fully connected network, or, in the lower panel, zero edges, i.e., an empty network. We refer to these two networks as the true data-generating networks. For both true underlying network structures, three network structures were randomly generated, yielding a sensitivity of 50% and a specificity of 90%. For simplicity, edge weights are of equal strength. Solely based on a visual comparison of the three randomly generated network structures, not knowing the true underlying network, one may be tempted to conclude that these three networks are very different, i.e., there is a great amount of heterogeneity present. In fact, all of this perceived "heterogeneity" is caused by sampling variation. (a) Fully connected case and (b) Empty case. networks of 0%. If we were to repeat this scenario, on average, we expect to see an overlap of 10% between any two random individuals.
Analytically, we can derive the probability of obtaining the same network twice in both conditions of our thought experiment (i.e., the true fully connected case and the true empty case) to be: in which n is the number of nodes, k is the total number of possible edges (i.e., nðn À 1Þ=2), and p represents the sensitivity in the fully connected network case and 1 À specificity in the empty network case.
Using this expression, we can calculate that the probability of obtaining the exact same network twice in the thought experiment where the true network is a fully connected, with n ¼ 5 nodes and sensitivity of p ¼ 0.5 to be less than 0.1%. Likewise, we can see that in the case of an empty network with a specificity of 0.9 (i.e., p ¼ 0.1) the probability of obtaining two identical networks is 13.7%. This thought experiment shows that we can find a great deal of variety in individual networks when visually comparing them, despite our true underlying network model being invariant over individuals. This exposes how prone visual inspection of network models is to error. In addition, often only the number of present edges is taken into account in order to determine overlap between two networks, but if we were to, for example, base the overlap between two networks not only on the estimated edges but also on the absent edges, the overlap between the networks in our fully connected case would rise to 50% and in the empty case to an astonishing 90%. Given our homogeneous sample, another way to increase the estimated overlap between two individual network models could be established by increasing sensitivity. Sensitivity (i.e., the power to detect an existing edge) increases as the number of time points increases. Suppose we increase our sensitivity to 90%. In the case of a fully connected network, this means that if we were to draw two random networks to compare, we would expect an average overlap of 80%, and the chance of two randomly chosen estimated network models to be exactly alike would be 13.7%.
In order to increase power without inflating the Type-I error rate-i.e., increasing sensitivity while keeping specificity high-there is only one solution: we need to collect more data. But how much more data? Much work has been dedicated to identifying the minimum amount of data required when performing ideograph analysis, (e.g., Epskamp et al., 2018c;Lane & Gates, 2017;Mansueto et al., 2022;Nestler & Humberg, 2021), 3 however, a considerable variety of results has been reported, likely as a function of simulation and estimation details as well as variation in the network structure simulated from, and were not always viewed in light of perceived heterogeneity. Therefore, it is unclear how much data is needed in order to achieve the preferred sensitivity to make a valid claim about heterogeneity in individual network analysis.
The scenario sketched in our thought experiment is merely a hypothetical one. In reality, we deal with various edge strengths and differences in network structures, such as a more sparse or a more densely connected network structure. There is a delicate interplay between edge weights, network structure, properties of the data (such as sample size and effect size), and the estimation technique used. To shed light on this interplay, we turn to a simulation study to determine the effects of sampling variation on detecting heterogeneity in individual networks under different network structures using three popular estimation techniques.

Beyond thought experiments: Network heterogeneity in estimated network structures
To further illustrate the main point of this paper-be careful with the interpretation of all variability as evidence for individual differences-we conducted a simulation study in which we applied three idiographic network estimation techniques (graphical VAR (graphicalVAR), multilevel VAR (mlVAR) and Group Iterative Multiple Model Estimation (GIMME)) to estimate individual network structures from simulated homogeneous data. This simulation study aims to answer the question to what extent current network-based tools to detect heterogeneity yield valid results to determine the amount of heterogeneity present. Throughout this paper, we have defined the concept of heterogeneity to mean the absence of homogeneity. This is a strong stance on heterogeneity, as the slightest difference is taken as a sign of heterogeneity. Within the network literature, the operationalization of heterogeneity differs. Therefore, it is 3 The work of Epskamp et al. (2018c) shows t ¼ 100 is sufficient when performing a grpahicalVAR analysis, while the work of Mansueto et al. (2022) shows sufficient sensitivity when t ¼ 500. In addition, Lane and Gates (2017), show t > 60 is suitable for GIMME to pick up small to moderate effects while the work of Nestler and Humberg (2021) shows t > 100 for GIMME to perform well. crucial to have an understanding of different individual network modeling techniques that can be used to operationalize heterogeneity in order to explore if and how these methods account for normal fluctuations within the data. To this end, we provide a brief overview of the most commonly used idiographic network estimation tools.

An overview of idiographic network estimation tools
The most common way to estimate individual networks is with some type of Vector Autoregression (VAR) model. There are two modeling frameworks that extend the VAR model to incorporate both temporal and contemporaneous network structures: the graphical VAR model (GVAR; Epskamp et al., 2018c) and the structural VAR model (SVAR; G. Chen et al., 2011). The GVAR model represents the contemporaneous network using undirected effects, whereas the SVAR model represents the contemporaneous network using direct effects. Several methods exist for estimating model parameters (edge weights) coupled with model structure (presence or absence of edges) for both modeling frameworks and for N ¼ 1 and N > 1 datasets.
In N ¼ 1 settings, the GVAR model can be estimated through iterative regularized estimation by using the multivariate regression with covariance estimation algorithm (MRCE; Abegaz & Wit, 2013;Guo et al., 2010), which is implemented in the R packages graphicalVAR and sparseTSCGM, or it can be estimated through maximum likelihood estimation as implemented in the psychonetrics R package (Epskamp, 2020a(Epskamp, , 2020b). The SVAR model can be estimated in the N ¼ 1 setting through model search using generic structural equation modeling software such as the R package lavaan (Rosseel, 2012). In the N > 1 setting, each of the N ¼ 1 methods can be used separately for each individual, a practice investigated in this paper.
In addition to estimating a network model for each individual separately, methods exist that allow one to borrow information across participants. In particular multi-level estimation is often used with the GVAR model by using the two-step multi-level GVAR algorithm as implemented in the R package mlVAR (Epskamp et al., 2018c;, or Bayesian estimation as implemented in Mplus version 8 and higher (Schultzberg & Muth en, 2018). The SVAR model is often estimated in N > 1 settings using GIMME (Gates & Molenaar, 2012), which is implemented in the R package gimme (Lane et al., 2020). For details on the modeling frameworks used in this simulation study and the estimation techniques used, see Supplement A.
In this paper, we focus on three methods for GVAR and SVAR estimation: MRCE using graphicalVAR, two-step multi-level estimation using mlVAR, and Group Iterative Multiple Model Estimation using gimme. We focus on these methods because they have been used for the purpose of estimating individual networks and detecting heterogeneity in existing research (e.g., Beck & Jackson, 2020;Beltz et al., 2016;Bringmann et al., 2013;Reeves & Fisher, 2020;Rodriguez et al., 2022). To simplify the description of results below, we will refer to each of these three methods by referring to their corresponding R packages: graphicalVAR, 4 mlVAR and GIMME. We expect results from other methods, such as maximum likelihood estimation of N ¼ 1 GVAR models or Bayesian multi-level estimation of N > 1 models, to align, as these methods perform similarly in estimating network structures from data (e.g., Mansueto et al., 2022 shows a strong overlap between the graphicalVAR and psychonetrics packages). In the following sections, we will discuss these three methods, the tools used to detect heterogeneity, and examples of their use in practice for each method separately.

Regularized estimation using graphicalVAR
The graphicalVAR package uses the MRCE algorithm to estimate the GVAR model (Epskamp, 2020a). This algorithm makes use of LASSO regularization (Tibshirani, 1996), which iteratively estimates temporal coefficients (regression weights between t À 1 and t) through regularized regression, and contemporaneous coefficients (partial correlations after controlling for temporal effects) through the graphical LASSO algorithm (Friedman et al., 2008). The algorithm utilizes two LASSO penalty parameters, which are chosen by optimizing the extended BIC (EBIC; J. Chen & Chen, 2008;Abegaz & Wit, 2013;Epskamp et al., 2018c).
It is important to note that as of yet there are no techniques available that are developed specifically to detect heterogeneity using graphicalVAR estimates. The detection of heterogeneity after estimating a GVAR model through graphicalVAR is currently mainly based on (1) differences in network topology, (2) differences in network density, and (3) differences 4 GraphicalVAR therefore refers to the MRCE method implemented in the graphicalVAR R package, not to the GVAR modeling framework, which can also be estimated using other methods. in node connectivity measures (i.e., measures that are often used to assess the relative importance of nodes within the network 5 ).
De Vos et al. (2017) can be taken as an example of how visual inspection of individual network models is used to detect heterogeneity. De Vos et al. (2017) estimated individual network models for people diagnosed with Major Depressive Disorder (MDD) and healthy controls. Participants completed a questionnaire assessing seven positive and negative affect items (e.g., "Feeling cheerful" and "Feeling irritated") three times a day over a period of 30 days. Using visual inspection De Vos et al. concluded that individual networks did not resemble each other in terms of density and topology, taking this as an indication for a strong level of heterogeneity. Fisher et al. (2017) took a somewhat similar approach. The authors estimated individual network models for people with Generalized Anxiety Disorder (GAD) and MDD and for individuals who presented a comorbid clinical picture of both GAD and MDD. Forty participants completed questions on anxiety and depression symptoms (e.g., "Feeling hopeless," "Loss of interest or pleasure"), positive and negative affect, rumination, behavioral avoidance, and reassurance seeking four times a day for at least 30 days. In addition to examining the network topology, Fisher et al. examined strength centrality. Results showed that node centrality metrics differed strongly between individuals. The authors interpret these differences as an indication for heterogeneity among individuals with GAD, MDD, or a comorbid presentation of GAD and MDD. The two studies described here should be taken merely as examplementary. Given the importance of the question of whether individuals differ from one another (and from the between network structure), more studies with similar interests and set-up can easily be found. For some more recent examples see Jongeneel et al. (2020), Levinson et al. (2022), and Rodriguez et al. (2022).

Multilevel estimation using mlVAR
The second method we discuss is two-step multi-level GVAR estimation using the mlVAR package , which is based on the multilevel VAR models proposed by Bringmann et al. (2013). This package first estimates temporal coefficients by performing a series of multi-level node-wise regressions between time points t À 1 and t. Withinperson centering of predictors is used in each of the regressions, and person-wise means are included as between-person predictors, such that within-and between-person effects can be separated (Hamaker & Grasman, 2015). This separation also leads to between-person effects, that are gathered in a between-persons network, which we do not use in the present paper. Next, in a second step, another series of node-wise regressions are performed on the residuals of the first series of node-wise regressions, leading to estimates of contemporaneous networks.
As such, the mlVAR package leads to the estimation of three network structures: (a) a contemporaneous network (per subject and fixed-effect structures over all subjects), (b) a temporal network (per subject and fixed-effect structures over all subjects), (c) a between-subjects network (fixed-effects only). In addition, the standard deviations of random effects across the population on the temporal and contemporaneous network parameters are also returned and can be visualized as networks, leading to two more networks: (d) a temporal network of random effects and (e) a network of the standard deviations from the random effects.
In contrast to fully idiographic N ¼ 1 estimation, for example, by using the graphicalVAR package, multilevel estimation offers a systematic approach to detect heterogeneity. Because of the multilevel structure of the model, one can inspect the network of standard deviations of random effects. These standard deviations show the degree to which network parameters exhibit individual differences. Bringmann et al. (2013) recommend a cutoff of one standard deviation for the resulting edge weights to represent "large inter-individual differences." This means that every edge with a weight above one standard deviation can be seen as truly heterogeneous. In addition, individual differences can also be observed by visually inspecting personalized networks obtained by adding the fixed and random effects.
Bringmann et al. applied multilevel VAR, on which the mlVAR package is based, to ESM data from 129 participants with depressive symptoms. Participants' mood was assessed by a questionnaire containing six mood variables (e.g., "Cheerful" and "Fearful") 10 times a day over a period of 6 days. Bringmann et al. inferred a network of inter-individual differences by examining standard deviations of the random effects. Results indicated a high level of individual variability on the self-loop for worry, as on the self-loops for cheerful, sad, relaxed, and fearful. Furthermore, variability in the relations between cheerful and relaxed, cheerful and sad, and fearful and worry was found, from which the authors concluded that a fair amount of heterogeneity in mood between the participants with depressive symptoms was present. As opposed to tools used to assess heterogeneity for graphical VAR models, the tool used to assess heterogeneity for multilevelVAR-inspecting standard deviations from the random effects-operationalizes heterogeneity in terms of edge weight differences. 6 Group iterative multiple model estimation (GIMME) The third method we evaluate is GIMME (Gates & Molenaar, 2012;Lane et al., 2020), as implemented in the gimme package, which is a method for estimating SVAR models in N > 1 datasets (Beltz & Gates, 2017;Lane & Gates, 2017). GIMME makes use of stepwise model search strategies through structural equation models, for example, by utilizing the lavaan package for R. Similar to multilevel modeling, GIMME aims to combine the worlds of nomothetic and idiographic research. However, where multi-level modeling aims to search for quantitative similarity across people (what is the edge-weight?), GIMME aims to search for qualitative similarity across people (which edges are included?). GIMME does this by creating personspecific as well as group-level edges. Group models are estimated by an iterative search procedure that identifies temporal and contemporaneous relationships that would significantly improve model fit for most individuals. After obtaining group-level effects, GIMME will then search for temporal and contemporaneous individual-level effects, continuing this process until an optimal fit is ensured. In addition to individual model output, this leads to a model containing both temporal and contemporaneous grouplevel edges and temporal and contemporaneous individual-level edges, where the width of the edge corresponds to the number of individuals this edge was estimated for.
The detection of heterogeneity can be executed by inspecting the number of individual paths. Grouplevel paths reflect homogeneity, as these edges need to be present in at least 75% of the sample, whereas individual paths reflect heterogeneity (Beltz et al., 2016;Beltz & Gates, 2017). This means heterogeneity is operationalized in terms of the number of individual level edges, i.e., edges that did not improve model fit for most individuals but did improve model fit so for specific individuals. Beltz et al. (2016) applied GIMME to data from 25 individuals with personality pathology. Participants were instructed to complete an online survey on related clinically relevant behaviors for 100 consecutive evenings. Sixteen items from the daily surveys concerning behavioral manifestations of personality disorders were used to take four personality facets into account: negative affect, detachment, disinhibition, and hostility. Results revealed some group-level contemporaneous relations indicating some homogeneity, however, weights for these relations differed across participants. In addition, participants showed different relations on the individual level, reflecting heterogeneity.
In sum, several tools are used to assess the amount of heterogeneity for individual network models. However, the question remains how suitable these metrics are to reveal the prevalence of heterogeneity when sampling variation is taken into account. Our ability to disentangle sampling variation from the true effects in our data, i.e., power, is closely related to our ability to detect heterogeneity. A question that arises is whether, with current tools, we have enough power to separate sampling variation from true heterogeneity. To what degree can we be confident that we are looking at heterogeneity and not just normal fluctuations in our data that arise as a function of sampling variance? We investigate this question in simulation studies reported below.

Methods
Network structures. Homogeneous data were generated based on three network structures: a syntheticdata network structure, a more sparse network structure estimated from data of one clinical patient, and a dense network structure estimated from data of multiple patients. We will refer to the first network structure as the synthetic-data network, the second as the case-data network and the third as the Geschwinddata network. 7 The synthetic-data network structure is a sparse chain graph, e.g., 1-2, 2-3, etc., containing eight nodes, with a network density (the number of present edges divided by the number of possible edges) of 29% for the contemporaneous network and 14% for the temporal network. The average absolute edge weights are M ¼ 0:34 and M ¼ 0:32 for the contemporaneous and temporal effects respectively. For the data generating PDC and PCC matrices see 6 Which is indirectly related to network topology and density. 7 As these data generating matrixes are based on GVAR models, while GIMME operates under a SVAR model, we have included an additional SVAR network structure and preformed the simulation set up as described in this section. For more details on the SVAR network structure, simulaiton set up and results see Supplement B.
Supplementary material B. 8 We included this synthetic-data network model as previous simulation studies have shown that network estimation works well under this structure (Epskamp et al., 2018c). For a graphical representation of the synthetic-data network structure, see Figure 2 panel (a).
In an attempt to create networks that approximate reality, the other two network structures that were used as data-generating structures are estimated from clinical data. The case-data network is estimated from data of one clinical patient (t ¼ 47) measured over a period of two weeks. The original network was estimated using graphicalVAR and contains seven nodes about the patient's mood such as "Relaxed," "Sad" and "Nervous." For more information on the data, see Epskamp et al. (2018a). 9 The density of the network is 48% and 10%, with an absolute average edge weight of M ¼ 0:11 and M ¼ 0:16 for the contemporaneous and temporal networks respectively. For the data generating PDC and PCC matrices, see Supplementary material B; for a graphical representation of the casedata network, see Figure 2, panel (b).
The Geschwind-data network is estimated from data of multiple patients (n ¼ 129) measured over a period of six days (mean t ¼ 60). For more information on the data set, see Geschwind et al. (2011). 10 The original network was estimated using mlVAR and contains six nodes such as "Cheerful," "Sad," and "Relaxed." For this simulation study, the average contemporaneous and temporal networks were taken as the network structure of one subject. The density of the contemporaneous network and temporal network are 62% and 63%, with an average absolute edge weight of M ¼ 0:16 and M ¼ 0:06 for the contemporaneous and temporal effects respectively. For the data generating PDC and PCC matrices, see Supplementary material B and for a graphical representation of the Geschwind-data network, see Figure 2, panel (c).
Simulation procedure. Taking these three underlying network structures as the true data generating model for each individual, we simulated homogeneous data in which any apparent individual differences are due to sampling variation. Details on the parameter values under which we simulated can be found in supplementary material B. In addition to network structure, we varied the number of participants, n 2 {50, 100, 200}, and the number of time points, t 2 {50, 100, 200, 400}, These time points were chosen to represent plausible values, t 2 {50, 100}, potentially ideal values, t 2 {200, 400}, to include scenarios in the simulation where the methods should be expected to function well. We used three different estimation routines (grapicalVAR, mlVAR, and GIMME) to estimate the resulting network models, in total, creating a 3 Â 3 Â 4 design. Each condition was repeated 100 times.
In line with common practice, in order to determine the amount of heterogeneity present in estimated network models through graphicalVAR, we evaluated topology of the resulting networks by visually inspecting three randomly chosen estimated network structures. For these individuals, we compared network density, i.e., the number of actual edges divided by the number of possible edges. Furthermore, across the entire sample, we computed the node centrality measure strength for each individual network. Strength centrality is defined as the sum of the edge weights of a given node (in absolute value). For temporal networks, strength is divided into in-strength, the sum of absolute incoming edge-weights, and outstrength, the sum of absolute outgoing edge weights. Centrality measures have been taken as an indication of the importance of individual nodes in a network (Costantini et al., 2015;Opsahl et al., 2010). To determine the amount of resemblance in strength measures across individuals, we correlated the centrality measures for each possible pair of individuals. The distribution of these correlations gives us an idea of the spread of the strength of the association between centrality measures. If centrality measures are in fact a suitable measure for separating sampling variation from heterogeneity, we would expect a narrow distribution peaked around a strong positive correlation, because all samples are drawn from completely homogeneous populations.
For mlVAR estimates, we inspected the distribution of standard deviations of estimated random effects, both for contemporaneous and temporal networks. In line with Bringmann et al. (2013) we used a cutoff score of one standard deviation. Edges from the standard deviation network with a weight above this cutoff are taken to represent" large" heterogeneous effects, whereas edges below this cutoff are taken to be sampling error. In order to compare edge weights estimations from the different network structures, 8 In addition to a synthetic-data network structure with 8 nodes, a sparse chain graph containing 16 nodes was generated in similar fashion to inspect the effect of the size of the network. Results for this large network structure can be found in the Supplementary Material B. 9 The dataset used for generating the network used in the current simulation study can be found in the supplementary materials of Epskamp et al. (2018a). 10 The dataset used for generating the network used in the current simulation study can be found in the supplementary materials of  Contemporaneous and temporal data generating network models for each individual. The upper panel shows the synthetic-data network with eight nodes simulated to be a chain graph, i.e., 1-2, 2-3, etc. The middle panel shows the case-data network containing seven nodes estimated from clinical data of one subject measured over time. The lower panel shows the Geschwind-data network containing six nodes, estimated from clinical data of multiple subjects. The average contemporaneous and temporal networks are taken as the data-generating network structure for each individual in this study. Edges across networks are scaled to a maximum of 0.69, therewith edges between networks can be visually compared to one another. (a) Synthetic-data network, (b) Case-data network, and (c) Geschwind-data network. edge weights of the standard deviation networks for both contemporaneous and temporal networks were standardized.
To our knowledge, for GIMME, there is no cutoff or rule of thumb that has been proposed in the literature to determine whether a sample is homogeneous. We suggest taking the percentage of homogeneous edges as an indication of the amount of homogeneity. GIMME provides group-level output in which grouplevel effects are indicated by black edges, and individual effects are indicted with grey edges (Lane & Gates, 2017). Group level edges are seen as homogeneous, whereas individual level edges are seen as heterogeneous (Beltz & Gates, 2017). To assess the amount of estimated homogeneity, we propose to take the percentage of estimated group level edges, i.e., the number of group level edges divided by the total number of edges estimated in the group level network: 11 % of homogeneous edges ¼ # of group level edges # of group level edges þ # of individual level edges Â 100 Data was generated based on the three previously described network structures using the graphicalVARsim function from the graphicalVAR package in R (Epskamp, 2020a) which simulates data from a graphical VAR model. Network models were estimated from the simulated data using the R packages: graphivalVAR (Epskamp, 2020a), mlVAR , and GIMME (Lane et al., 2020). The simulation study was performed in R, version 4.0.2 (R Core Team, 2015). R code for the simulation set-up is included as supplementary materials. In addition, a sensitivity analysis was performed for graphical VAR, multilevel VAR, and GIMME to inspect the resemblance between the data generated networks to the estimated networks. Results for the sensitivity analysis can be found in Appendices B, C & D, for graphical VAR, multilevel VAR, and GIMME, respectively.

Results
Across all tools, sensitivity was often too low to make valid claims about heterogeneity (for sensitivity analysis of graphical VAR see Figure B1 in Appendix B, for mlVAR see Figure C1 in Appendix C, and for GIMME see Figure D1 in Appendix D). 12 In addition, sensitivity was dependent on network structure for the graphical VAR and the ml VAR method, as well as the estimation technique. Sensitivity was highest when the network structure was sparse (i.e., with relatively few edges) and the edges present were relatively strong as is the case for the synthetic-data network structure. When the network structure was denser, as is the case for the Geschwind-data network, or showed relatively weak edges, sensitivity started to decline. This latter effect was particularly strong when estimating temporal networks using the graphical VAR method, and when estimating contemporaneous networks using GIMME.
To illustrate the lack of sensitivity and its implications for the validity of claims on heterogeneity, we randomly drew three individual networks estimated using the graphical VAR method, from the 5, 000 simulated networks for the synthetic-data network condition (see Figure 3 panel a and c to visually compare the three networks). It is important to note that these networks were chosen to convey how the visual comparison of individual networks can go wrong, especially when sensitivity is low. For t ¼ 50 the individual contemporaneous and temporal networks did not differ much with respect to overall edge weight (Average edge weight and standard deviation for the estimated contemporaneous network model of individual 1: M ¼ 0:11ðSD ¼ 0:09Þ, individual 2: M ¼ 0:15ðSD ¼ 0:11Þ, and individual 3: M ¼ 0:14ðSD ¼ 0:12Þ; Temporal network of individual 1: M ¼ 0:13ðSD ¼ 0:09Þ, individual 2: M ¼ 0ðSD ¼ 0Þ, and individual 3: M ¼ 0:16ðSD ¼ 0:14Þ). However, individual networks differed with respect to network density (network density for estimated contemporaneous networks: 36%, 14%, 46% and temporal network: 44%, 0%, 32%, of individuals 1, 2, and 3 respectively.), and in detected edges. Few similarities could be found in terms of detected edges; in addition, the strength of these edges varied across the three exemplary individual networks.
Perceived heterogeneity vanished when visually inspecting the estimated network structures using graphical VAR when t ¼ 400, see panels (b) and (d) in Figure 3. When visually inspecting the individual 11 It is important to note that individual effects can be both contemporaneous and temporal. If an individual effect was estimated both as contemporaneous for some individuals and temporal for others, this was taken to mean there is some level of heterogeneity, and both edges added up to the total number of individual edges. 12 Sensitivity analysis for GIMME was infeasible for the original three data generating network structures as the data generating model for a GVAR model does not directly correspond to one SVAR model (the modeling framework used by GIMME to estimate networks). However, one SVAR does directly correspond to a unique GVAR model. Therefore, a SVAR data generating network is added to the simulation study, see Supplement B for further details on the data generating network structure and the full simulation study results. contemporaneous networks, the networks showed significant resemblance. Note that sensitivity is high for the contemporaneous networks; all seven data generating edges were estimated for each of the individual networks. However, in terms of network density, we still found notable differences. For the contemporaneous networks density differed (32%, 43%, 29%, for t ¼ 400 for individuals 1, 2, and 3 respectively), as networks A and B showed a few weak erroneous edges. These edges can be disregarded in terms of edge weight in comparison to the other edges present. In terms of overall edge weight we found slight differences (average edge weight and standard deviation for the estimated contemporaneous network model of Although sensitivity for temporal networks as estimated with graphical VAR was high at t ¼ 400, the exemplary temporal networks of individual 1, 2, and 3 showed a larger range of network density than for their contemporaneous networks (network density of 63%, 57%, 37%, for the estimated temporal networks of individual 1, 2 and 3 respectively). In terms of overall absolute edge weight, we found small differences (temporal network individual 1: M ¼ 0:15ðSD ¼ 0:13Þ, individual 2: M ¼ 0:14ðSD ¼ 0:13Þ, individual 3: M ¼ 0:20ðSD ¼ 0:13Þ). Heterogeneity in terms of estimated edges, network density, and edge weights was more pronounced for the case-data and the Geschwind-data network structures. Examples of three networks estimated under the case-data and the Geschwind-data network structure can be found in Figures A1 and A2 in Appendix A.
Moving forward, we investigated the use of centrality measures as an indication of individual differences between networks. For the entire sample, based on each individual network as estimated with the graphical VAR method, node centrality strength was computed. For contemporaneous networks, we computed one measure: strength (the sum of all absolute edge weights connected to the node), for temporal networks, this measure is divided in in-strength (the sum of all absolute incoming edge-weights) and outstrength (the sum of all absolute outgoing edge weights). For all possible pairs of individual networks within the sample, the strength estimates were correlated in order to give an indication of the resemblance in strength centrality. Figure 4 shows the distributions of the correlation of these strength measures for the entire sample. Regardless of network structure, strength measures correlated more with one another for contemporaneous and temporal networks as the number of time points increased. Furthermore, strength measures showed a higher correlation for contemporaneous networks than for temporal networks. For all three network structures, the correlation of strength measures for the contemporaneous network are somewhat similar from roughly t ¼ 200 onward.
The median correlation for in-and out-strength for the Geschwind-data network remained low in all cases, in-strength: r ¼ 0:23, out-strength: r ¼ À0:002: For the case-data network the median correlation for in-strength and out-strength increased from r ¼ À0:13 and r ¼ À0:009, to r ¼ 0.86 and r ¼ 0.94 respectively. For the synthetic-data network the median correlation for in-and out-strength increased from r ¼ 5.14 and r ¼ 0.15 to r ¼ 0.67 and r ¼ 0.63. Surprisingly, in-and out-strength correlation increased most for the case-data network as opposed to the synthetic-data network.
It is important to note here that the correlation between node strengths, although regularly used in the literature, is a problematic measure of similarity Forbes et al., 2017a). This is because, in general, weak correlations may be the result of low variance in estimated edge weights (i.e., a situation in which all edge weights are approximately the same). If the variance in individual centrality measures is low this can lead to very weak correlations (around zero) between two individual centrality measures, even when measures are similar. This means a high correlation could be an indication of similarity in strength centrality, while a weak correlation may also be an indication of similarity-making the interpretation of correlations asymmetric and less intuitive. In this particular case, however, weak correlations did not result because of a lack of variance, but rather because of a lack of a linear relation between individuals' centrality measures. Therefore, here we may safely interpret weak correlations as an indication of a lack of resemblance.
For mlVAR estimation results, we inspected the standard deviation of random effects for the contemporaneous and temporal effects. We took a cutoff of one standard deviation for the edge weights to determine whether population heterogeneity was present. In order to compare the edge weights of the different network structures and their standard deviations, edge weights were standardized. Standardized density distributions of the standard deviation of random effects for the contemporaneous and temporal effects can be found in Figure 5.
For contemporaneous network structures, the estimation of large heterogeneous effects was limited with relatively small sample sizes n ¼ 50 and t ¼ 50. For each of the three data generating network structures, no heterogeneous edges were detected when the number of individuals increased to n ¼ 200 and the number of time points per subject to t ¼ 400. For all temporal network structures with n ¼ 50, the estimation of heterogeneous edges was limited. Fewer differences were detected for the synthetic-data network structure and the case-data network structure than under a Geschwind-data network structure. When increasing time points to t ¼ 200 per individual, even less heterogeneity was detected.
In contrast to mlVAR, to the best of our knowledge, for GIMME there has been no cutoff recommended in the literature. We decided to inspect the percentage of homogeneous edges and take this as an indication of the amount of heterogeneity present, see Figure 6. The percentage of homogeneous edges was 7%ðSD ¼ 0:33%Þ, 9%ðSD ¼ 0:33%Þ, 10%ðSD ¼ 0:64%Þ for n ¼ 50 and t ¼ 50 for the synthetic-data, case-data and Geschwind-data network structures respectively. This percentage increased as the number of time points increased. When t ¼ 400 the median percentage of homogeneous edges was 97%ðSD ¼ 4:45%Þ, 64%ðSD ¼ 5:48%Þ, 64%ðSD ¼ 8:63%Þ for synthetic-data, case-data and Geschwind-data network structures. Interestingly, we found that the percentage of homogeneous edges detected decreased as the number of participants increased. The same pattern was found for a larger synthetic-data network structure, see Figure S10 in supplementary materials B.

Discussion
The present paper aimed to expose effects of sampling variation and power limitations in the investigation of heterogeneity in idiographic network models. We have shown that, even if the underlying network model is invariant across individuals, limitations regarding specificity and sensitivity can impose a great deal of variety in individual network structures estimated from the data, especially when the number of time points is small. In a simulation study, we evaluated four different network tools for assessing heterogeneity: visual inspection, comparing centrality measures, inspecting random effect standard deviations, and applications of GIMME. Results showed that low statistical power places considerable limits on the validity of conclusions regarding heterogeneity for all tools. At low sample sizes, an applied researcher is likely to erroneously conclude that a sample is heterogeneous even if it is in fact, homogeneous. Under the right circumstances (high sample size and favorable generating network structures), inspecting the standard deviation of random effect using multilevel VAR modeling and GIMME proved most suited for detecting heterogeneity, but these circumstances are currently not realized in the great majority of designs in many relevant fields, such as psychopathology research.

Many lower-powered analyses versus one highpowered analysis
When comparing idiographic network models, an important pitfall lies in the tendency to interpret all variability in resulting networks as heterogeneity between people, while some of this variability is the result of fluctuations in the data created by noisee.g., sampling variation. Combining sampling variation with overall conservative estimation procedures (high specificity and lower sensitivity at low sample sizes) will inevitably lead to a mismatch in the presence and absence of estimated edges in addition to varying edge weights across individuals. Of course, the most straightforward way to combat these problems is to collect more data in an effort to increase statistical power without inflating the Type I error (i.e., increasing sensitivity while keeping specificity high).
Importantly, the number of time points needed to estimate robust idiographic network models is dependent on the network structure (e.g., a more sparse or dense network structure) and the estimation technique used. However, we found that even when we simulated under ideal conditions-a sparse network structure with strong edges-at least 200 time points per individual may be needed to obtain network structures that are sufficiently robust to support comparisons between individuals. The amount of time series data needed quickly adds up if the network structure is less than optimal, such as when one is dealing with a sparse or dense structure with weak edges. In these cases, the number of time points one needs per individual may exceed 400. This amount of data is often not realized in current research practice typical applications in, e.g., psychopathology research. Although some cases have been known to feature sufficiently intense measurement schemes in this respect Figure 6. Percentage of homogeneous edges as estimated through GIMME for the synthetic-data, case-data, and Geschwind-data network structures. , one rarely encounters time series of this length in, for instance, ESM designs.
In practice, we may see only about 60 observations per individual (e.g., Beck & Jackson, 2020;Jongeneel et al., 2020;Rodriguez et al., 2022), which might seem like a large amount of data given the difficulty of collecting it, but this may not be enough to estimate a robust individual network. If longer time series cannot be obtained per individual, then the only other solution is to step down from fully idiographic research (estimate one model per person) and instead use an analysis strategy that borrows information from other participants (e.g., multi-level modeling or GIMME). This may then lead to performing one well-powered analysis rather than a sequence of lower-powered analyses at the idiographic level. Such methods performed best in our simulations as well. However, both methods still showed a great deal of illusionary heterogeneity (few common edges in GIMME and relative large random effect sizes in multi-level VAR modeling) and come with their own disadvantages, such as the assumption of heterogeneity, or in the case of multilevel modeling, pulling individual estimations toward the group mean (Bringmann et al., 2013).

Comparing centrality across individuals
There is an ongoing debate on the use and interpretation of centrality measures in network analysis (Bringmann et al., 2019;Dablander & Hinne, 2018;Hallquist et al., 2021). Our results add to the discussion by showing that if one wants to use centrality measures as a way to determine heterogeneity, one should proceed with caution. Here we share four considerations to illustrate this. First, as we showed in our simulation study, it is possible to find negative correlations between estimated strength measures for two generated networks with the same true underlying network structure when the number of observations is less than t ¼ 100 per individual. Second, although adding data leads to a stronger correlation of strength centrality measures between individuals in most settings, we showed that it is possible that the correlation between centrality coefficients is low even with a large number of data points per individual (t > 200). Third, while weak correlations between strength centrality can be the result of a lack of overlap between two networks, this need not be the case because weak correlations between strength centrality can also be the result of low variance. While strong correlations between centrality measures are an indication of similarities between strength centrality, weak correlations do not necessarily have to be an indication of dissimilarity in strength centrality, making the interpretation of correlations between strength centrality counterintuitive; this warrants even more caution in the interpretation of these correlations. Fourth and last, the supposed difference in centrality measures could be due to sampling variation; hence, in the absence of systematic ways to statistically assess the correlation between centrality measures, it does not provide any way to differentiate between real heterogeneity and random fluctuations in the data.
Importantly, this does not mean that centrality coefficients cannot be used to study network topology or individual differences therein. Instead, it means that, just like with any other statistic, inspecting centrality estimates must always be assisted by assessments of precision and robustness (e.g., confidence or credibility intervals and other functions of sampling distributions). Such assessments have now become standard in cross-sectional network analysis (Epskamp et al., 2018b). Future research should investigate extending such methods for application to time-series data. Until suitable measures of precision have become available, the use of visual inspection of differences in centrality measures to detect heterogeneity in network analysis must be considered questionable.

Network heterogeneity and replicability
The discussion of heterogeneity in idiographic network models both resembles and mirrors the discussion on network replicability in cross-sectional network studies Forbes et al., 2017a). In cross-sectional network analysis, it has been recognized that there may be a great deal of sampling variation in the estimated network structures, which, combined with conservative estimation methods, may by itself lead to differences in estimated network models (Fried et al., 2021). In addition, it has been recognized that fluctuations in centrality measures may be due to sampling variation even at high sample sizes, depending on the generating network structure and the centrality measure chosen (Epskamp et al., 2018b). To this end, data-driven sampling methods or more sophisticated statistical methods are routinely used together with presented results to assess accuracy and stability in a sample as well as to statistically compare network structures of different samples (Epskamp et al., 2018b;Van Borkulo et al., 2017;Williams et al., 2020), as presumably, both sampling variation and heterogeneity lead to differences in estimated network structures.
In the literature on cross-sectional network models, differences between estimated network structures are sometimes highlighted as evidence for limited replicability of network models (Forbes et al., 2017a;2017b). However, these differences should be expected to arise partly due to the expected replicability of a method given its sensitivity and specificity (Williams et al., 2020). When studying genuine heterogeneity over different samples used for network analysis, assessments of heterogeneity cannot be assessed without taking sampling variation into account (Isvoranu et al., 2020). This also holds true for network models (or any other statistical model) estimated from time-series data. Of note, even though we simulated 400 data points as our largest sample size, such a sample size is not deemed very large when estimating a cross-sectional network of, for example, 10 nodes (featuring 45 parameters). Time-series data features auto-correlations due to temporal ordering of the data, which both require more parameters to be estimated (145 in total for a graphical VAR model) and leads to effectively less information per observation. Both factors lead to a lower expected "replicability" of the network structures, which is highlighted in our simulation results.

Limitations and future directions
At least two limitations of the current simulations need to be addressed here. First, although GIMME performed well with sufficient amounts of data, there is reason to believe the current simulation setup might put GIMME at a disadvantage. In our simulation study, we generated data based on graphicalVAR models, GIMME, however, operates under a structural VAR (SVAR) model. Running simulation conditions using a SVAR model underlying the data yielded similar results as for the GVAR models. However, our simulation study highlighted the influence of the network structure on the results. As GIMME was evaluated under just one SVAR structure, generalizations must be made cautiously. Thus, the present findings speak to the effectiveness of using GIMME when assessing heterogeneity in data that arise from a given VAR model as specified in this paper but have less bearing on the general quality of GIMME as a statistical procedure operating within its own data space. Second, although the most common way to estimate individual networks is by using some type of VAR model, it should be noted that many more estimation procedures exist, for example, based on other types of autoregressive (AR) models than the VAR model, such as AR Moving Average (ARMA) or integrated ARMA (ARIMA) (Box et al., 1970; for an overview of these techniques see: Hamaker and Dolan (2009)). However, we found that most of the network model papers focusing on the detection of heterogeneity made use of the techniques evaluated in this paper.
We conclude by highlighting future avenues for the comparison of individual networks regarding heterogeneity. The current paper has shown that several techniques used to compare individual network models are of limited usefulness in disentangling true heterogeneity from noise. Therefore, there is a pressing need for the development of tools that can do so effectively. A first possible route for future research is to rely more on hierarchical or multi-level models in which heterogeneity across individuals can explicitly be included in the model as random effects and tested to feature non-zero variance. Currently, the mlVAR package, which can be used for multi-level graphical VAR estimation, does not allow for this comparison. An alternative is to use the Dynamical Structural Equation Modeling (DSEM) module in MPlus (Schultzberg & Muth en, 2018), which allows for estimating multi-level VAR models, but with marginal correlations rather than partial correlations between innovation terms. This will allow for assessing random effects on temporal structures but not on contemporaneous partial correlation networks. A second option is to extend data-driven methods used to investigate differences in cross-sectional networks to results from time-series models . A problematic aspect here, however, is that resampling techniques cannot readily be applied as the temporal ordering of the data plays an important role in the analysis as well. Finally, promising Bayesian methods have been developed to assess not only evidence for differences but also evidence for the similarity between different samples (Williams et al., 2020). Such methods could perhaps also be expanded for time-series analysis.

Conclusion
In the quest to address the individual within psychology, some concerns arising from the use of these individual models are in danger of being overlooked, including common statistical problems that arise due to sampling variation. Although research within the field of individual network analysis is promising, it is vital to recognize the core principle of statistical analysis: accounting for the uncertainty that arises as a result of sampling variation. Sampling variation alone can lead to striking differences (illusory heterogeneity) in estimated models from different subjects, even if these subjects are identical (fully homogeneous). This article aimed to function as a wake-up call, addressing some of the concerns regarding analyzing the individual and calls for caution in the use and interpretation of these new time series techniques when making claims about heterogeneity. Figure A1. Output from graphical VAR. Three individual networks (contemporaneous and temporal) were generated under the same case-data network structure. Panel (a) shows the contemporaneous networks for t ¼ 50, (b) for t ¼ 400, (c) their corresponding temporal networks with t ¼ 50, and (d) for t ¼ 400. (a) Contemporaneous network for individual 1, 2, and 3 when t ¼ 50, (b) Contemporaneous network for individual 1, 2, and 3 when t ¼ 400, (c) Temporal network for individual 1, 2, and 3 when t ¼ 50, and (d) Temporal network for individual 1, 2, and 3 when t ¼ 400. Figure A2. Output from Graphical VAR. Three individual networks (contemporaneous and temporal) were generated under the same Geschwind-data network structure. Panel (a) shows the contemporaneous networks for t ¼ 50, (b) for t ¼ 400, (c) their corresponding temporal networks with t ¼ 50, and (d) for t ¼ 400. (a) Contemporaneous network for individual 1, 2, and 3 when t ¼ 50, (b) Contemporaneous network for individual 1, 2, and 3 when t ¼ 400, (c) Temporal network for individual 1, 2, and 3 when t ¼ 50, and (d) Temporal network for individual 1, 2, and 3 when t ¼ 400.