lavaan.survey : An R Package for Complex Survey Analysis of Structural Equation Models

Introduces the R package lavaan.survey , a user-friendly interface to design-based complex survey analysis of structural equation models (SEM). By leveraging existing code in the lavaan and survey packages, the lavaan.survey package allows for SEM analyses of stratiﬁed, clustered, and weighted data, as well as multiply imputed complex survey data. lavaan.survey provides several features such as SEM with replicate weights, a variety of re-sampling techniques for complex samples, and ﬁnite population corrections, features that should prove useful for SEM practitioners faced with the common situation of a sample that is not i.i


Introduction
Structural equation modeling (SEM) is a popular framework for formulating, fitting, and testing an abundant variety of models for continuous interval-level data in a wide range of fields.Special cases of SEM include factor analysis, (multivariate) linear regression, path analysis, random growth curve and other longitudinal models, errors-in-variables models, and mediation analysis (Bollen 1989;Kline 2011).The main development of SEM has been in social science fields such as psychology (Ullman and Bentler 2003), education (Kaplan 2008), and sociology (Duncan 1975;Saris and Stronkhorst 1984), while more recently SEM is finding applications in other fields such as ecology and biology (Grace 2006) and neuroscience (Mclntosh and Gonzalez-Lima 1994;Roelstraete and Rosseel 2011).
While classical SEM theory assumes independently and identically distributed (iid ) observations (Bollen 1989), applications often analyze data from complex surveys that may involve stratification, clustering, and unequal selection probabilities, violating this assumption (Skin-ner, Holt, and Smith 1989).For example, Marsh and Hau (2004) explained the relations between academic self-concepts and achievements in a 26-country complex multistage survey.Outside of the realm of complex surveys clustering may also occur, for instance in Byrnes et al. (2011)'s analysis of the effect of storms on kelp forest food webs, where variables such as kelp density and species richness are likely correlated across sites that are geographically close to each other.It is well-known that under complex sampling, both point and variance estimators derived under iid assumptions may produce biased and inconsistent estimates (Cochran 1977;Skinner et al. 1989).This finding was reproduced for SEM parameter estimates by Kaplan and Ferguson (1999) and Asparouhov and Muthén (2005).
Adjustments to point and variance estimators for SEM under complex sampling were discussed by Muthén and Satorra (1995) and Stapleton (2006), and estimation using pseudomaximum likelihood procedures by Asparouhov (2005Asparouhov ( , 2006) ) and Asparouhov and Muthén (2005).These procedures have since been implemented in the standard closed-source commercial software for SEM: LISREL (Jöreskog and Sörbom 2006), Mplus (Muthén and Muthén 2012), EQS (Bentler 2008), andStata (StataCorp 2011;Press 2011).Another popular commercial program, AMOS (Arbuckle 2006), does not implement complex sampling estimation at the date of writing.None of the open-source SEM packages, sem (Fox 2006;Fox, Nie, and Byrnes 2012), openMx (Boker et al. 2011), and lavaan (Rosseel 2012), directly implement complex survey adjustments.These packages do provide enough flexibility to allow for such adjustments through resampling methods if the user is willing to program these (the sem manual provides some guidance to this effect).More user-friendly interfaces are currently not available.Furthermore, with the exception of Stata, the commercial packages that do implement estimation procedures for complex sampling still omit features dealing with several complications that may arise in the analysis of complex surveys: • Some secondary data sources such as the OECD's Programme for International Student Assessment (PISA) do not provide the sampling design variables directly, but instead provide a set of so-called "replicate weights" (OECD 2009).In principle this represents a considerable simplification of highly complex survey analysis (Brick, Morganstein, and Valliant 2000).Currently, however, no SEM software allows for adjustments of SEM estimators using replicate weights; • More generally, variance estimation of SEM parameters with complex sampling using resampling methods such as the jackknife and bootstrap are not implemented directly but require additional programming on the part of the user (see Stapleton 2008, for a discussion of these methods in the context of SEM); • SEM is primarily an analytic method, so that finite population corrections may not usually be relevant (e.g., Fuller 2009, p. 342).However, SEM is also a flexible method of reformulating several descriptive methods for which the finite population may be of interest, such as domain mean and model-based small area estimation.Currently finite population corrections, which may be relevant for these purposes, are not available in SEM software.
The purpose of this article is to introduce the lavaan.surveypackage for the R environment (R Core Team 2012), which serves to bring user-friendly complex survey SEM analysis to the open source SEM implementation lavaan.In addition, by leveraging the many features of the survey package (Lumley 2004(Lumley , 2010(Lumley , 2012) it provides users with the above features currently omitted from commercially available SEM software.Thanks to code reuse and the flexibility of the survey and lavaan packages, the lavaan.surveypackage is able to provide an extremely flexible, user-friendly, and open source framework for design-based analysis of complex survey data using SEM.It also allows for the analysis of multiply imputed complex survey data (Little and Rubin 1987;Graham and Hofer 2000).The package is available from the Comprehensive R Archive Network at http://CRAN.R-project.org/package=lavaan.survey.
Section 2 discusses the theory of structural equation modeling in general and SEM under complex sampling in particular.After a brief overview of the package in section 3, sections 4.1, 4.2, and 4.3 demonstrate the usage of the package by applying it to SEM analyses arising from the literature.

Technical explanation
Different methods have been suggested to deal with complex sampling in SEM.In this article we will only deal with "aggregate" design-based methods (Skinner et al. 1989, p. 8).Three aggregate design-based point estimators have been suggested in the literature: adjustment of the weights or sample size to an effective sample size (Stapleton 2002), pseudo-maximum likelihood (Muthén and Satorra 1995;Asparouhov 2005Asparouhov , 2006)), and weighted least squares estimation (Skinner et al. 1989, p. 86); see Stapleton (2006) for an overview of these approaches.For these point estimators, different variance estimation methods are possible, including linearization (Skinner et al. 1989, p. 83) and a range of resampling methods (Stapleton 2008).This article and the lavaan.surveypackage adopt a framework due to Muthén and Satorra (1995) that encompasses pseudo maximum likelihood (PML) or weighted ("generalized") least squares (WLS) point estimation, and variance estimation by linearization or resampling.The option of which combination of methods to employ is left to the user, the default being PML, the de facto standard for SEM at the time of writing (Asparouhov 2005).
The framework adopted here starts from the observation (Skinner et al. 1989, p. 78) that the problem of the estimation of SEM parameters under complex sampling can be simplified to the usual problem of estimation of means under complex sampling through a classical three-step device (e.g., Fuller 1987, appendix 4.B).The current discussion of this remarkable observation is necessarily more condensed than that found in the comprehensive discussion by Muthén and Satorra (1995), but, following the design principle of lavaan.survey,also slightly more general in that it allows one to take into account all complex survey design aspects allowed for in the survey package.I focus on explaining the three steps which comprise the basic logic behind complex survey analysis of SEM models followed by lavaan.survey: 1.All SEM parameter estimates are implicit nonlinear functions of the vector of variances and covariances or, more generally, moments of the observed variables; 2. The moments themselves are linear estimators of the mean vector of a redefined vector of variables d; 3. Therefore, after fitting a structural equation model using the estimation method of choice, the usual theory for variance estimation of means under complex sampling can be applied to the (co)variances and projected back into the parameter space.
This simple logic produces an incredibly flexible framework for SEM estimation incorporating sampling weights, stratification, and clustering, but also resampling methods and multiple imputation.

Structural equation models
Given a p-vector of observed variables y, let Σ denote its population covariance matrix, and S n a sample estimator such that E π (S n ) = Σ, where E π denotes expectation under the sampling design.A structural equation model (SEM) is a covariance structure model Σ = Σ(θ) expressing the population covariances Σ as a function of a parameter vector θ, an often used parameterization of SEM being the "LISREL all-y" model also used by lavaan, where η is a vector of latent variables, or, equivalently, random effects, and ζ and are vectors of (latent) residuals.This model implies the covariance structure where Φ := Var(ζ) and Ψ := Var( ).The model encompasses well-known methods such as factor analysis (B = I), random effects modeling (B = I, Λ = 1 p , Ψ is diagonal and dg(Ψ) = ψ), and path analysis (Λ = I, Ψ = 0), as well as any combinations that might be identified.Typically the model parameters are not identifiable without further restrictions; indeed it is customary to impose more restrictions than necessary for identification, allowing for a test of these restrictions.In that case the model degrees of freedom is usually taken to equal df = p * − q, where q is the number of free parameters and p * = p(p + 1)/2, the number of unique (co)variances.For clarity the mean structure is ignored, though the present treatment is easily extended to means and other moments (Satorra 1992).
SEM parameter estimates θn are obtained by minimizing a fitting function F (s n , σ(θ)), where s n := vech(S n ), σ := vech(Σ), and the vech operator denotes columnwise stacking of the non-redundant moments (Magnus and Neudecker 2007).The most common choice for F is the maximum likelihood (ML) fitting function, It is straightforward to show (e.g., Bollen 1989, ch. 4 appendix) that minimizing F ML maximizes the likelihood of the data under multivariate normality.As the sample size increases, the F ML becomes asymptotically equal to the weighted ("generalized") least squares (WLS) fitting function (Browne 1984), where V n consistently estimates a symmetric estimation weight matrix V .For the asymptotic equality to normal-theory ML to hold, the estimation weights V n are chosen as the inverse of the normal-theory sampling variance of s n (Fuller 1987, appendix 4.B), where D is the duplication matrix (Magnus and Neudecker 2007).When the data do not follow a multivariate normal distribution, both F ML and F WLS still provide consistent estimates.The asymptotically optimal estimator is obtained by replacing V with a consistent estimate of Var(s n ) −1 , a method sometimes called "asymptotic distribution free" (ADF) estimation (Browne 1984;Satorra 1989).In spite of its asymptotic optimality, the ADF estimator has performed very badly in simulation studies of its finite sampling behavior (Hu, Bentler, and Kano 1992;Bentler and Yuan 1999;Chou, Bentler, and Satorra 2011).
The choice of fitting function and estimation weight matrix V n thus determine the precise form of the estimator.Regardless of this choice, unless the model is just-identified, θn (s n ) is neither a linear estimator nor an explicit function of s n .However, θn (s n ) is the solution to the equation ∂F [s n , σ(θ)]/∂θ = 0, so that under mild regularity conditions (Satorra 1989), θn (s n ) can be viewed an implicit function of s n .

Estimation of (co)variances under complex sampling
Since the θn are determined entirely by the s n , it follows that the complex sampling properties of SEM parameter estimates depend on those of the variances and covariances.These can be easily studied by redefining them as a linear estimator.Suppose a complex sample is obtained by sampling, not necessarily with equal probability, primary sampling units (PSU's) within strata, after which second and third stages are sampled.For instance, in the British sample of the European Social Survey round four, 232 Postcode sectors (PSU's) were sampled within strata, 20 delivery points (2SU's) sampled within Postcode sectors, and for each delivery point, one person aged 15 or over was sampled (3SU's).
Let x denote a design-consistent estimator of E π (x).The estimator x possibly but not necessarily involves weighting.Define where y hict is the vector associated with the t-th third-stage unit of the c-th second-stage unit of the i-th PSU of stratum h, with the summation going over all the units within the i-th PSU (Satorra 1992, 260).This device essentially redefines the observed data matrix to d, simplifying the estimation of the (co)variances s to that of estimating the mean vector This simplification of the problem to that of estimating a mean implies that the usual theory of estimators for means may be applied.Assuming that Γ := VAR π (s n ) is finite, the variance estimator Γ can be obtained by "nonparametric" Taylor linearization (Skinner et al. 1989, 48), or by resampling methods including jackknife, balanced repeated replicates, bootstrap, and half-sample methods (Wolter 2007).The R implementation of these methods is described by Lumley (2004Lumley ( , 2010)).These variance estimators Γ are also consistent under nonnormality, so that all complex sampling analyses also take into account any effects of nonnormality of the observed variables.In smaller samples, non-parametric estimators of Γ may become unstable; since, under the model, Γ also estimates VAR π [s n − σ(θ)], Yuan and Bentler (1998, p. 293) suggested using residuals in a model-based adjustment to Γ that was found to stabilize the estimator in small samples; lavaan.surveyallows this optionally.
A common problem in surveys is that of missing data, either through item or unit nonresponse.A common solution under the assumption of missingness at random given covariates is to multiply impute the missing values (Little and Rubin 1987;Rubin 2004).For M imputations this yields M estimates s mn for m = 1, .., M .The point estimate is then simply the average over imputations, s.n .The variance Γ of these estimates can be estimated by (Schafer 1997) In lavaan.survey,this procedure is applied while also taking into account the complex sampling design within imputations whenever multiply imputed datasets are given as data.This is the approach taken for other analysis types in many software packages including, for instance, the survey package.It should be noted, however, that any survey weights should be included in the imputation models, and Equation (3) may not consistently estimate the variance if the response mechanism is not at random with respect to the weights (Kott 1995;Kim, Brick, Fuller, and Kalton 2006).Some care should therefore be taken with this approach when weights are involved.

SEM under complex sampling
Complex sampling impacts a structural equation model analysis in two ways: 1.The conventional estimator of the covariance matrix may be biased and inconsistent.This, in turn, causes bias in the SEM parameter estimates; 2. The sampling variance Γ of consistent estimates of the (co)variances may be affected by the design.This will affect standard errors and test(s) of model fit.
The first point suggests simply that the design-consistent estimator of the (co)variances d should be used for s n in the fitting function.This will then guarantee consistency of the estimator θ(s n ), at least for the population value θ(σ), when the model is misspecified.It can be shown that minimizing F ML with d as an estimate of s n is equivalent to the PML estimator introduced by Skinner et al. (1989, pp. 80-3).
The second point means that a design-consistent estimate Γ of the sampling variance of the (co)variances under the complex sampling scheme needs to be taken into account.Wolter (2007, eq. 6.2.2) notes that where r n is a term that converges to zero as the sample size increases (see also Lumley 2004, 3-4).When the model is identified, ∂ θn /∂s n can be obtained by invoking the implicit function theorem: since θn is the solution to ∂F [s, σ(θ)]/∂θ = 0, where ∆ := ∂F/∂σ(θ), and o θ (n 1/2 ) is a term that, under the model, converges to zero as the sample size increases (see Neudecker and Satorra 1991, for the precise form of both quantities).Thus, when the model is correct, the asymptotic variance of the parameter estimates is where E θ denotes expectation under the model.Equation 6 can be recognized as the "sandwich" estimator of variance, which is well-known in econometrics.When F ML is used and the data truly have an iid multivariate normal distribution, but also when F WLS is used and V n is chosen such that V ΓV = V , the asymptotically optimal estimator (AO) is obtained.Its asymptotic variance can then be seen from Equation 6to reduce to For F ML this corresponds to the inverse of the Fisher information.
Two strategies can now be followed for estimation of SEM parameters under complex sampling: WLS: Fit the model using weighted least squares with data s n = d, and the (Moore-Penrose) inverse Γ+ as the estimation weight matrix V n in Equation 2.1.In this case, after fitting the model, the simple form of Equation 7 can be used as a variance estimator; Robust ML (PML): Fit the model using maximum likelihood with data s n = d, and estimate the variance with Equation 6, setting V = Γ −1 NT and plugging in the designconsistent Γ matrix.
WLS estimation is asymptotically optimal and similar to the commonly employed complex sampling estimator for multiple regression (Skinner et al. 1989;Skinner and de Toledo Vieira 2007;Fuller 2009, section 6.3).However, it can lead to unstable estimates and has been found in simulation studies to have larger mean square error than the robust ML method, as well as producing test statistics that do not approach their nominal distribution (Hu et al. 1992;Bentler and Yuan 1999;Vieira and Skinner 2008;Chou et al. 2011).For this reason the robust ML method is the default in lavaan.survey,though both methods are implemented.

Goodness-of-fit testing of the restrictions
Under the null hypothesis of model correctness, the residual covariances should approach zero as the sample size increases.A chi-square statistic for a test of this hypothesis when the estimation procedure is AO is χ 2 AO (df ) = nF .A large number of other fit indices exist, all of which are derived either from this model chi-square statistic or from the residuals directly (Kline 2011).When the robust ML procedure is used, nF no longer follows a chi-square distribution and an adjustment to the test statistic is necessary.The most commonly used adjustment matches the first moment of the test statistic to that of the true distribution: where δ is the average generalized design effect (Skinner et al. 1989, pp. 43-4).This adjustment is known in the SEM literature as the "Satorra-Bentler" (SB) chi-square (Satorra and Bentler 1994).In lavaan, robust ML estimation using the mean generalized design effect adjustment is called "MLM" estimation, and this is the default used in lavaan.survey.The mean generalized design effect is given in the output as the "scaling correction factor for the Satorra-Bentler correction".If one defines n eff := n/ δ, Equation 8 provides a rationale for the effective sample size method discussed by Stapleton (2002), although the method of obtaining n eff is different, and there is no guarantee that standard errors will be adequately corrected just by replacing n by n eff in variance estimators.Other adjustments include the "meanand-variance" adjustment (Satorra 1992, p. 258) and the Satterthwaite (1941) adjustments, which adjusts the degrees of freedom in addition to the value of the test statistic itself.These options are available in lavaan.surveyby making use of their implementations in lavaan.
In small samples or with few clusters relative to the number of observed covariances to be estimated, it may happen that Γ is singular.This is not a problem in itself, as the model is typically restricted so that the parameters may still be identified from the observed moments even when some of these moments are collinear.However, for robust ML there may also be cases where the effective degrees of freedom for the goodness-of-fit test are smaller than df = p * − q, which is the default.After using robust ML with a singular Γ, lavaan.surveytherefore checks whether the degrees of freedom for the goodness-of-fit test are still valid.If they are not, a warning is issued with the advice to use the Satterthwaite adjustment of degrees of freedom.lavaan.surveyalso implements Yuan and Bentler (1998)'s model-based smoothing of Γ which may represent an alternative way of alleviating this problem; in the future we plan to add more smoothing estimators to the program.

About the package
lavaan.survey is a concise package written entirely in interpreted R code and published under the GPL 2. It links the survey and lavaan packages, thus providing an interface to a great variety of structural equation model analyses under complex sampling.Its aim is to provide sensible defaults while allowing for flexibility and to check for any estimator-specific problems where possible.In addition, it pools the mean and covariance estimates from datasets obtained by multiple imputation to allow for complex sampling analyses with missing data.The general workflow of a lavaan.surveyanalysis is: • Create lavaan fit object to specify the model; • Create survey svydesign object to specify the sampling design; • Call the lavaan.surveyfunction with the fit and design objects as arguments.
The order of the first two items does not matter.Table 1 gives an overview of the arguments taken by the function lavaan.survey.OECD 2009).Due to the high complexity of PISA's sampling design as well as for purposes of nondisclosure, the OECD does not provide the original design variables, but rather a set of 80 replicate weights generated by the closed-source program WesVarPC1 .To take the sampling design into account in SEM analyses of PISA data, these replicate weights need to be taken into account, a feature not available in any standard SEM software.This section shows how such an analysis may be performed using lavaan.survey.The 2003 PISA data are freely available from the OECD website2 ; here I follow the original authors in analyzing a subset of these data containing the Belgian sample and variables measuring students' math ability in four domains, self-concept of their math ability, self-efficacy, and school level.For the precise definitions of these variables and the indicators used please see the appendix to Ferla et al. (2009).In addition, gender and socio-economic status of the parents will be used as fixed covariates.The subset analyzed is included in the online supplement and can be loaded onto the R workspace with the command R> load("pisa.be.2003.rda")
Having defined the sampling design, the next step is to perform a conventional SEM analysis without taking this design into account.Figure 1 shows a simplified version of the model analyzed by Ferla et al. (2009) as a path diagram.The figure shows that a reciprocal effect between self-concept and self-efficacy is specified, which is identifiable due to the absence of a direct effect of school level on self-concept.Self-concept and efficacy affect math ability and are also partially mediating variables for the effect of school level on math ability.All structural relationships are controlled for gender and socio-economic status of the parents.For clarity, Figure 1 omits the indicators of the three latent variables self-concept, self-efficacy, and ability; these are assumed to be measured by five, eight, and four indicators respectively in a simple factor structure.This model can be specified in lavaan syntax as: In this syntax, =~indicates "measured by" and ~"regressed on".Means and variances are freed in the lavaan function call.For more information on the precise working and syntax of lavaan, please see Rosseel (2012).A conventional SEM analysis on the raw data is then performed: R> fit <-lavaan (model, data=pisa.be.2003, auto.var=TRUE, std.lv=TRUEBy specifying estimator="MLM", this conventional analysis uses the option of calculating nonnormality-robust standard errors and chi-square, yielding a "scaling correction" (average generalized "design" effect of nonnormality) of 1.1.This serves to make the conventional analysis more comparable to the complex sampling analysis, which can be expected to increase the scaling correction relative to the value after taking nonnormality into account.Now that a survey design object and a lavaan fit object have been obtained, the complex sampling analysis can be performed using lavaan.survey:This example call uses all the defaults, i.e., robust ML estimation without model-based smoothing; this is equivalent to pseudo-maximum likelihood (PML) estimation.The average generalized design effect taking into account both nonnormality and the sampling design is 1.45, which is 31% higher than that for the conventional analysis only taking nonnormality into account.
Table 2 gives the point and standard error estimates for the parameters of primary interest, corresponding to the black arrows in Figure 1.For comparison, both the results from the "naive" conventional SEM analysis and from the lavaan.surveyanalysis employing the replicate weights are given.Table 2 shows that the differences in point estimates are relatively small.The differences in standard error estimates, however, are considerable.The average ratio between the standard errors from the complex sampling and the conventional analysis, termed "conditional relative efficiency" (creff) in the table (Oberski 2011, chap. 3), is 1.38.
The model fits very badly, even after taking the scaling due to complex sampling into account.One method of investigating which restrictions are especially offensive is to examine the "modification indices" (also known as "score" or "Lagrange multiplier" tests) for restricted parameters.Under the null hypothesis of a correct restriction, these will follow a chi-square distribution with one degree of freedom.The plyr function arrange was used to give a concise syntax.The modification indices adjusted for complex sampling are shown as mi.scaled.The most problematic restrictions appear to be zero error correlations among items in the self-efficacy construct.This may indicate common method variance or multidimensionality of the latent self-efficacy variable.

Application 2: Confirmatory factor analysis of welfare state attitudes
Roosma, Gelissen, and van Oorschot ( 2012) discuss an analysis of citizens' attitudes toward the welfare state.They used data from the 2008 (fourth) round of the European Social Survey (ESS) to compare factor means that represented whether respondents thought the welfare state was legitimate and achieved its stated goals across countries.An additional goal of the study was the investigation of the relationship between the factors.The ESS is a multinational survey in which each country has its own sampling design -a design that can vary in complexity from simple random sampling from a population register (e.g., Denmark) to four-stage stratified cluster sampling (e.g., Turkey).This section analyzes the United Kingdom sample, focusing on two of the factors investigated by Roosma et al. (2012).
The ESS data for round four are publicly downloadable online3 .The UK subset analyzed here additionally includes information on strata and primary sampling units that is absent from the public database.The subset is included in the online appendix.
R> load(file="ess4.gb.rda") Focusing on two factors representing "range" and "outcomes goals" (Roosma et al. 2012, Table 1), a two-factor model is formulated using lavaan syntax as: R> model.cfa<-+ "range =~gvjbevn + gvhlthc + gvslvol + gvslvue + gvcldcr + gvpdlwk + goals =~sbprvpv + sbeqsoc + sbcwkfm" The "range" factor represents the opinion that government should be responsible for various outcomes associated with the welfare state and has six observed indicators.The "outcome goals" factor represents the respondent's opinion of whether these goals are actually reached, and is measured by three observed variables.Of particular interest here are the covariance between the two factors as well as the factor variances.For the precise question formulations and rationale behind these definitions of the factors, please see the ESS website and the original article respectively.The factor model can be estimated with lavaan using R> fit.cfa.This shows again that the nonnormality of the data have a considerable effect on the standard errors and chi-square test of model fit.Since the original authors were satisfied with the attained model fit and the focus is here on the estimation of the relationship between the factors, we shall ignore the issue of model fit in this application.
The U.K. sample was stratified based on 37 regions (stratval).Within each region, postcode sectors (psu) were listed in increasing order of population density and tenure; sectors with fewer than 500 delivery points were combined.In the first stage a systematic sample of 232 sectors (225 in Great Britain and 7 in Northern Ireland) was then drawn with probability proportional to postal delivery point count.The second and third stages were simple random sampling of 20 postal delivery points within the sector, and selection by Kish grid of one person aged 15 or over at the selected address.In some cases there was an intermediate stage in which a dwelling required selection from an address before a person could be selected within the dwelling.The final sampling weights (dweight) were constructed by the ESS sampling team by multiplying all selection probabilities together, normalizing to the nominal sample size, and finally trimming the weights at 4. This rather complicated design can be neatly summarized in a survey design object: R> des.gb <-svydesign(ids=~psu, strata=~stratval, weights=~dweight, + data=ess4.gb) After the definition of the sampling design, the confirmatory factor analysis taking it into account using robust maximum likelihood is again performed using lavaan.survey:The mean generalized design effect taking both nonnormality and the sampling design into account is 21% higher than that taking only nonnormality into account.
An alternative to the default robust ML estimator is weighted least squares using the (generalized) inverse of Γ as a weight matrix.This can be accomplished in lavaan.surveyby changing the estimator to "WLS".R> fit.cfa.surv.wls<-lavaan.survey(fit.cfa.ml,survey.design=des.gb,+ estimator="WLS") Since, as remarked above, this method was found unstable in a range of simulation studies and applications, a possible adjustment is to smooth the estimation weights for WLS using the model-based smoothing method suggested by Yuan and Bentler.This can be done by changing the setting for estimator.gammato "Yuan-Bentler".
R> fit.cfa.surv.wls.yb<-lavaan.survey(fit.cfa.ml,survey.design=des.gb,+ estimator="WLS", + estimator.gamma="Yuan-Bentler") To estimate the covariances used as input for the SEM analysis and their Γ matrix, it is also possible to use the various resampling methods available in the survey package: R> des.gb.rep <-as.svrepdesign(des.gb,type="JKn") In this case, the results of complex sampling CFA using the jackknife gives results very similar to the default method using Taylor linearization.
Table 3 presents point and standard error estimates, as well as a relative efficiency compared with the naive method for three parameters of interest, namely the factor variances and the covariance.Table 3 gives results using the five different methods discussed above: ML not taking the sampling design into account ("naive"), the default robust ML method using linearization to estimate Γ ("Taylor"), the robust ML method using the stratified jackknife to estimate Γ, weighted least squares using the linearized Γ−1 matrix as weights ("WLS") and the same method using the model-based smoothing estimate of Γ suggested by Yuan & Bentler ("WLS, Y-B").
Table 3 again shows that the point estimates for different versions of maximum likelihood are very similar.As the conditional relative efficiencies indicate, the standard error estimates using both Taylor linearization and the jackknife are substantially larger than those obtained under the naive method; these two methods give very similar results in all respects.Unadjusted weighted least squares estimation gives point estimates that are wildly different from those obtained by all of the other methods: most strikingly, the relationship between the factors is estimated to be positive rather than negative using this method, with z-values larger than 2 for both WLS and the other methods.However, when the Yuan-Bentler smoother is applied to the Γ matrix, point estimates are obtained that are much more similar to those obtained with ML.
Although it is possible that WLS is the only method indicating the correct direction of the relationship, cautions in the literature on this estimator would suggest that the ML or Yuan-Bentler smoothed WLS estimators are likely to be preferable.A caveat on this last estimator is that it relies on the correctness of a model for which the fit statistic indicates significant misspecification, so that the stability it introduces relative to the WLS estimator may be paid for with some amount of bias.This trade-off may work out well in some applications, however.

Application 3: Multiple imputation of dropouts in the LISS panel
The LISS panel is a web survey panel recruited by probability sampling.A random sample of households from the Dutch population register was asked to participate in the panel, and all household members were then asked to partipate.To prevent undercoverage problems, the panel organizers provided internet connections and computers to those who did not have them.For more details on the design of the LISS panel and recruitment efforts, see Scherpenzeel (2011).The LISS panel measures a wide range of variables and allows external researchers to submit proposals as well.
One question of interest is whether these questions have sufficient reliability to be of use for substantive research.Thanks to the longitudinal design, this can be investigated by the so-called "quasi-simplex" model (Alwin 2011), which Figure 2 represents as a structural equation model for the variable "internet use".The model in Figure 2 is only identified by the additional restriction var(e t ) = ϑ, i.e., equality of measurement error variances (Jöreskog 1970).Parameters of interest could then be the error variance ϑ itself, but also the reliability ratio at a time point, for example ρ 1 := ϑ/var(cs08a247).
The data for estimating the model in Figure 2 can be loaded by: R> load(file="liss.rda") This dataset contains the answers 7369 respondents gave to the question "How many hours per week, on average, do you use the internet at home?" when asked in 2008, 2009, 2010, and 2011, as well as the household identifier.The model in Figure 2  The last line defines the reliability ratio as reliab.ratio.lavaan will automatically output the point estimate for reliab.ratioas well as its standard error (using the delta method).As before, the model accounting for household clustering (nohouse_encr) can be estimated with lavaan.survey:As shown in the lavaan output, although there are 7369 respondents in the dataset, after listwise deletion only 3374 complete observations are left to estimate the reliability.Figure 3 shows that this large amount of missing data is mostly due to panel attrition (dropouts) over time.The attrition is considerable, reaching 46% in the 2011 wave.
One method of dealing with this large amount of missing data is multiple imputation.As with many missing data methods, the core assumption is that the data are missing at random, given the covariates used for imputation.In this application the covariates for a missing answer by a respondent are that respondent's answers at previous timepoints, so that this assumption is not entirely implausible.To create multiply imputed datasets, the R libraries mice, mi, and Amelia can be used, but the user can also create multiply imputed datasets with an external program such as WinBugs.This example uses the mice package (Buuren and Groothuis-Oudshoorn 2011) to impute the dropouts 100 times (not run): R> library("mice") R> liss.imp<-mice(liss, m=100, method="norm", maxit=100) The lavaan.surveypackage follows the survey package's design in employing the mitools package to analyze multiply imputed datasets.This provides full flexibility by allowing the multiply imputed datasets to come from any source.After imputation using mice, an imputationList object can be created by: R> library("mitools") R> liss.implist<-lapply(seq(liss.imp$m),function(im) complete(liss.imp,im)) R> liss.implist<-imputationList(liss.implist) The analysis can then proceed as before, using liss.implistas data; lavaan.surveywill detect that multiply imputed datasets have been given as input and pool these in the estimation of the covariance and Γ matrices.
R> Using the multiply imputed dataset, the reliability estimate for the first timepoint is slightly lower than that when using the default listwise deletion.The confidence interval, in spite of the added uncertainty due to the multiple imputations, is narrower, indicating an overall increase in information used relative to listwise deletion.

Summary
Structural equation modeling is frequently applied to samples that are not iid : lavaan.survey is designed to deal with this case.This article introduced the lavaan.surveypackage and demonstrated its usage and some of its features by application to three examples motivated by the literature.Because the package joins together the lavaan and survey packages, both very flexible implementations of respectively structural equation modeling and complex survey analysis, the number of combinations of SEM analyses and sampling designs is countless, and not all of these possibilities could be demonstrated.Instead the goal has been to demonstrate, on the one hand, the manner in which SEM analyses using lavaan might be adapted to incorporate the issue of non-iid samples, and, on the other, the application of the SEM analysis framework to common problems in complex survey analysis.By joining these two worlds in the open-source R environment, lavaan.surveyhopes to stimulate progress in the various application areas of SEM, as well as provide a flexible framework for the analysis of methodological issues in survey methodology.An important limitation of lavaan.surveyat the time of writing is that categorical data cannot be incorporated, a feature that is planned for future releases of the package.

Figure 1 :
Figure 1: Path diagram for a simplified model of children's math ability in the PISA study.Observed indicators for the three latent variables are omitted from the picture for clarity.

Table 1 :
Overview of the parameters of a lavaan.surveycall.
Ferla, Valcke, and Cai (2009) use and features of lavaan.surveybydiscussingthreeexample applications.Code and data for the examples may be obtained from http://wp.daob.org/wp-content/uploads/2013/02/lavaan.survey-examples-paper.zip.4.1.Application 1: Replicate weights analysis of math ability in PISAFerla, Valcke, and Cai (2009)present a SEM analysis of academic self-efficacy and academic self-concept's effects on children's math ability in Belgium.Particularly, they were interested in how these variables mediated effects on math ability of other variables such as the level of the school, as well as in the interrelation between efficacy and self-concept.The authors analyzed data from the OECD's 2003 Programme for International Student Assessment (PISA), a large multinational survey that employs multistage stratified sampling ( fit.surv <-lavaan.survey(lavaan.fit=fit,survey.design=des.rep)

Table 2 :
lavaan allows the user to obtain modification indices Point and standard error estimates using robust ML with and without correction for the sampling design using BRR replicate weights.with the command modificationIndices, which are adjusted to the complex sampling design after the call to lavaan.survey.

Table 3 :
Factor variances and covariance of interest for attitudes to the welfare state in the UK sample of the European Social Survey(2008).Point and standard error estimates using robust ML without correction for the sampling design, by Taylor linearization, by jackknifing, using WLS, and using WLS with the Yuan-Bentler correction.In this call to the survey function as.svrepdesign, the jackknife for stratified designs ("JKn") is specified, which is the default.The confirmatory factor analysis can then be performed on the jackknifed covariances using lavaan.survey:R> fit.cfa.surv.rep<-lavaan.survey(fit.cfa.ml,survey.design=des.gb.rep)R> fit.cfa.surv.rep can be written in lavaan syntax as: The quasi-simplex model for the evaluation of measurement error in the question "How many hours per week do you use the internet at home?".