Meta-Analysis of Generalized Additive Models in Neuroimaging Studies

Analyzing data from multiple neuroimaging studies has great potential in terms of increasing statistical power, enabling detection of effects of smaller magnitude than would be possible when analyzing each study separately and also allowing to systematically investigate between-study differences. Restrictions due to privacy or proprietary data as well as more practical concerns can make it hard to share neuroimaging datasets, such that analyzing all data in a common location might be impractical or impossible. Meta-analytic methods provide a way to overcome this issue, by combining aggregated quantities like model parameters or risk ratios. Most meta-analytic tools focus on parametric statistical models, and methods for meta-analyzing semi-parametric models like generalized additive models have not been well developed. Parametric models are often not appropriate in neuroimaging, where for instance age-brain relationships may take forms that are difficult to accurately describe using such models. In this paper we introduce meta-GAM, a method for meta-analysis of generalized additive models which does not require individual participant data, and hence is suitable for increasing statistical power while upholding privacy and other regulatory concerns. We extend previous works by enabling the analysis of multiple model terms as well as multivariate smooth functions. In addition, we show how meta-analytic p-values can be computed for smooth terms. The proposed methods are shown to perform well in simulation experiments, and are demonstrated in a real data analysis on hippocampal volume and self-reported sleep quality data from the Lifebrain consortium. We argue that application of meta-GAM is especially beneficial in lifespan neuroscience and imaging genetics. The methods are implemented in an accompanying R package metagam, which is also demonstrated.


Introduction
Combining brain imaging data across studies has great potential in terms of increasing statistical power, enabling discoveries of effects that might not be detectable in any single dataset.Due to regulatory and practical concerns, privacy in particular, it may not be possible to analyze all data in a single place.It may also sometimes be beneficial to analyze data from multiple studies in two stages, even when the data are available at a single location, e.g., when data do not fit in computer memory or runtime is nonlinear in the number of participants (Riley et al., 2010).
Meta-analytic techniques offer one way to increase statistical power without sharing raw data.By estimating the relationships under study separately in each data location, pooled estimates are obtained by combining the estimates without sharing the underlying data.With some exceptions, meta-analytic methods have been developed for combining parameters from parametric statistical models or for effect measures like relative risks (Hedges and Olkin, 1985;Sutton and Higgins, 2008).However, there are important cases in which it is impractical and suboptimal to enforce a parametric representation of the association under investigation, e.g., when an appropriate parametric model to approximate the data is not known, or its interpretability is not clear, as with high-degree polynomials (Wood, 2017).Examples include lifetime trajectories of brain development (Fjell et al., 2010), air quality measures (Gasparrini and Armstrong, 2010), and ecological phenomena (Borchers et al., 1997;Pedersen et al., 2019).Generalized additive models (GAMs) (Hastie and Tibshirani, 1986;Wood, 2017) are attractive for studying such relationships, and can easily be extended to longitudinal or other forms of clustered data via generalized additive mixed models (GAMMs), which, in addition to GAMs, can also estimate random effects.
Figure 1 illustrates modeling lifespan trajectories of hippocampal volume changes using linear mixed models (LMMs) with quadratic and cubic polynomials for the age term, and a GAMM with a smooth term for age.The LMMs were fitted using R (R Core Team, 2019) package nlme (Pinheiro et al., 2019) and the GAMM was fitted using mgcv (Wood, 2017), all with a random intercept term.The data were taken from 4,364 observations of 2,023 healthy participants (age 4-93 years, 1-8 measurements per participant) from the Center for Lifespan Changes in Brain and Cognition longitudinal studies (Walhovd et al., 2016;Fjell et al., 2017).Detailed sample characteristics are presented in the Supplementary Material.The quadratic fit is not flexible enough to capture the steep increase during adolescence -moreover, it estimates the hippocampal volume to increase until the age of around 40.The cubic fit captures the volume growth during adolescence better than the quadratic fit, but fails to capture the decline that occurs after the age of around 70.The GAMM fit, on the other hand, is flexible enough to both capture the steep increase during adolescence, a period of moderate decline during adulthood, and finally a steeper decline at older age.All figures in this paper were created using ggplot2 (Wickham, 2016).
As the methods for meta-analysis of GAMs and GAMMs are identical, we will refer to both as GAMs in the rest of this paper, unless distinction is necessary.For reasons that we will explain below, in this paper we will not discuss meta-analysis of the underlying parametric functions across GAMs.Rather, we present methods for combining GAM fits for neuroimaging data by pointwise meta-analysis of the fitted values.Although developed for use in meta-analytic neuroimaging studies, the methods can of course be applied to other types of data as well.The models under study can include any number of terms, including multivariate smooth functions.In order to employ these techniques, models should be fit independently in each cohort, with basis functions and knot placement optimal to their given dataset.We hence extend the previous work by Schwartz and Zanobetti (2000) who combined univariate fits from Gaussian additive models.Sauerbrei and Royston (2011) also proposed a somewhat related approach using fractional polynomials, although requiring individual participant data being available.The methods developed by Sauerbrei and Royston (2011) have also been used by Crippa et al. (2018) who compared strategies for dose-reponse meta-analysis for computing relative risk estimates using both fractional polynomials and cubic spline models.
The main application we have in mind is multi-center studies in which it is impractical or not possible to analyze all brain imaging data in a single location.This is for instance the case for the Enhancing Neuro Imaging Genetics Through Meta Analysis project (ENIGMA: http://enigma.ini.usc.edu/),where meta-analysis of individual site summary statistics is the commonly applied strategy (e.g., Dennis et al. (2018);van Erp et al. (2018)).The methods developed require that some model relating an outcome of interest to a set of explanatory variables has been fitted on data from each cohort, and that the model estimates can be shared across cohorts such that the expected response and their standard errors at new values of the explanatory variables can be computed.We provide a companion R package named metagam (Sørensen et al., 2020) containing functions for removing all individual participant data from GAMs fitted with the mgcv and gamm4 packages (Wood, 2017;Wood and Scheipl, 2017), such that the resulting model object only contains aggregate measures which can easily be shared.The package also contains methods for combining the fits and analyzing the results, and will be demonstrated in Section 5.The comprehensive review of meta-analysis packages in R by Polanin et al. (2017) does not mention any existing packages for conducting this type of pointwise meta-analysis, so to the best of our knowledge, metagam is the first R package to provide this functionality.
The methods presented in this paper were motivated by a project in the Lifebrain consortium (http://www.lifebrain.uio.no/)(Walhovd et al., 2018).The goal was to study the relationship between self-reported sleep and hippocampal volume across six Lifebrain cohorts, and GAMMs were a natural model choice due to the expected non-linear age-relationships for self-reported sleep parameters and hippocampal volume.In this case a safe common data store was in place, but we initially hypothesized that it might be easier to have each cohort fit a model locally and share the overall result rather than analyzing all data in a single place, leading to the development of the methods presented here.

Meta-Analysis
Consider a situation in which M cohorts m = 1, . . ., M each have a dataset D m with n m observations of a response y and p explanatory variables represented by a vector x with p elements.In the cross-sectional case, there is one observation per study participant, so the data for each cohort can be represented by D m = {(y i , x i1 , . . ., x ip ) = (y i , x i ), for i = 1, . . ., n m }.In the longitudinal case, we assume subject i in cohort m has been measured n mi times.Notably, this includes the case of individually varying numbers of assessments and time intervals betweens assessments.The data are now represented by D m = {(y ij , x ij1 , . . ., x ijp ) = (y ij , x ij ), for i = 1, . . ., n m , j = 1, . . ., n mi }.In practice, some of the explanatory variables will be time-varying, while others will be time-invariant.
Our interest concerns performing statistical inference on data from all cohorts, in the case where data cannot be shared across cohorts and analyzed jointly.When the relationship under study can be represented by a parametric model, well established methods exist for obtaining meta-analytic estimates of the model parameters.For example, if the data are cross-sectional and a generalized linear model (McCullagh and Nelder, 1989) is being used, where g(•) is some link function and µ = E(y) is the expected response, estimates βm from each cohort can be combined using either fixed or mixed effect meta-analysis (DerSimonian and Laird, 1986;Gasparrini et al., 2012).The covariance matrix V ( βm ) of βm is used to define meta-analytic weights and to compute standard errors of the combined estimates.Linear regression is a special case of (1) with identity link g(µ) = µ and observed responses deviating from the mean according to a normal distribution with standard deviation σ: y ∼ N (µ, σ 2 ).An extension of (1) to longitudinal data is straightforward, by using linear mixed models (Laird and Ware, 1982) of the form where b represents the subject-specific random effects and z is the explanatory variable vector for the random effects.The fixed effect estimates βm would be combined in exactly the same way as for linear models.
There is a variety of similar and related approaches in other modeling frameworks, such as structural equation modeling to model individual differences in change over time (e.g., Brandmaier et al. (2018); Kievit et al. (2018)) but they all share the property that they are parametric models that are grounded in some a priori theory about the change process under investigation.

Generalized Additive Models
In many applications, assuming that a linear predictor η is a smooth function of the explanatory variables, rather than following the model (1) that is linear in its parameters (e.g., polynomial), may lead to better statistical fit, cf. Figure 1.Generalized additive models (GAMs) (Hastie and Tibshirani, 1986) take this approach.For example, with a univariate smooth term for each explanatory variable, the model becomes where each function f j (x j ) is assumed to be smooth in x j .Another example is a model with a single bivariate smooth term (Wood, 2006), In general, we let X s denote the set of explanatory variables used by smooth function f s (•).Hence, in (2), X s = {x s } for s = 1, . . ., p, and in (3), X s = {x 1 , x 2 } for s = 1.We can thus write a GAM with S smooth terms on the form where β 0 denotes the intercept.As each smooth term is only uniquely determined up to some additive constant, some constraints have to be imposed in the fitting procedure.Details are discussed in Appendix A.
Each smooth function f s is typically defined as a linear combination of A linear term for x j can be obtained by setting X s = {x j } and f s (X s ) = β s x j .
Figure 2 illustrates how a smooth function is constructed from four cubic regression splines using (5).The left panel shows the splines b 1 (x), . . ., b 4 (x), and the colored dots indicate the knot location of the respective spline.The colored curves in the right panel shows the contribution from each spline to the sum in (5) when using weights (γ 1 , γ 2 , γ 3 , γ 4 ) = (1, 0.2, −0.2, 0.5), and the black curve shows the resulting smooth function f s (x).

Smoothing
With the exception of TPRS, the basis functions commonly used are characterized by knots placed along each covariate axis.In the case of TPRS, the basis functions are instead low rank approximations of thin plate splines.Simply fitting a model of the form (2) using least squares or maximum likelihood with a large number of basis functions for each smooth typically leads to overly wiggly fits.Hence, model selection or smoothing methods are required.One approach is to fit models with a successively decreasing number of knots, and comparing them using some model selection criterion, e.g., AIC (Hastie et al., 2009, Sec. 5.2.2).However, when fitting models with more than one smooth term, and possibly with multivariate terms, the number of possible combinations of knots becomes impractically high.An attractive alternative is to make sure the number of knots is set high enough to capture the nonlinearity of the association being modeled, and then add a wiggliness penalty.With the common second derivative penalty, the criterion to be minimized in the Gaussian case with identity link function is where λ s is the smoothing parameter for the sth term.An optimal value of λ = (λ 1 , . . ., λ s ) can be obtained using, e.g., generalized cross-validation, restricted maximum likelihood, or marginal maximum likelihood.This smoothing approach is used, e.g., by the R package mgcv (Wood, 2017).
Figure 3 shows the impact of the smoothing parameter λ on the lifespan hippocampal volume curves from Figure 1.The fits were computed using 50 cubic regression splines with knots equally spaced over the range of age values in the data.Using restricted maximum likelihood, the optimal smoothing parameter λ was estimated to 434.The black line represents this optimal value, which appropriately captures the nonlinearities without being too wiggly.The gray curves show a range of fits with λ between 1 and 10 7 .The curves with low smoothing follow the same overall trend as the optimal curve, but are quite wiggly, particularly during late adolescence and early adulthood.On the other hand, the curves with high λ are smooth, but do not capture the actual nonlinearity in the data.

Knot Placement
If each cohort used exactly the same basis functions, including knot placement, all basis function weights could in principle be combined by treating the  weights as linear regression parameters, as shown by Gasparrini et al. (2012).In practice, this approach is most often not feasible, as the best fitting GAMs will have varying basis function b k (•).As also noted by Crippa et al. (2018), if the range of some explanatory variable x j differs between cohorts, e.g. the age-range used in the study, the knot placement may be suboptimal at best or even lead to non-identified models.As an example, we sampled a variable x uniformly over ranges [0, 1], [0.25, 1] and [0.5, 1], computed a linear predictor matrix X using seven cubic regression splines, and then computed the eigenvalue spectrum of the cross-product matrix X X. Zero eigenvalues of the cross-product matrix indicates that the spline basis is not of full rank.Figure 4 shows the average eigenvalue spectrum over 100 random simulations.In the case x ∈ [0, 1] none of the eigenvalues were zero, and a model would be identified.However, when x was distributed over [0.25, 1], one eigenvalue was on average very close to zero and when the range was [0.5, 1], three eigenvalues were very close to zero.The practical consequence of zero eigenvalues is that some model coefficients, basis function weights in this case, would not be identified.Although imposing additional penalization like the ridge (Hoerl and Kennard, 1970) may identify a solution while also shrinking the parameters towards zero, a better solution is to let the range of each separate dataset determine its knot locations.
As a concrete example of this identifiability problem, we consider modeling of lifespan trajectories of hippocampal volumes from six European cohorts.The data are further described in Section 5.As shown in Figure 10 (top), these cohorts have widely varying age distributions.We fit GAMs relating baseline age to hippocampal volume in each cohort, but enforced the same knot location for the models in each cohort.Knots were placed at eight equally spaced quantiles of the full data sample.Table 1 shows the corresponding spline coefficients, cf.eq. ( 5).A parametric meta-analysis using the methods of Gasparrini et al. (2012) would combine each of these parameters in order to create a meta-analytic fit.However, as can be seen, cohorts Barcelona and Whitehall-II have missing values (NA) for spline coefficient γ 2 due to nonidentifiability.In addition, there are extreme outliers: Barcelona has a severely outlying value for γ 1 , BASE-II has an outlying value for γ 8 , and Whitehall-II has outlying values for γ 1 and γ 3 .This lack of identification and unstable coefficients is caused by using knot locations which, because they are forced to be equal across the cohorts, are far from the actual age distributions in these cohorts.

Pointwise Meta-Analysis of Generalized Additive Models
We now propose a model for meta-analysis of GAMs.We start by assuming that a GAM has been fitted on the data from each cohort separately, obtaining estimates of the relationships under study, where the subscript m denotes cohort.Each model term has a corresponding estimated standard deviation σs,m (X s ), and the overall fit has estimated standard deviation σm (x).Importantly, x represent some grid of values of explanatory variables over which a meta-analytic estimate of the regression function is sought, and X s is the corresponding subset of variables for smooth term s.Equation ( 7) computes the expected response over this grid for cohort m.We illustrate our methods by considering meta-analytic estimation of each single term separately, but note that inference on any combination of smooth terms, including the overall function, is readily obtained with the same methods.Some additional details related to identification of smooth terms are discussed in Appendix A. For ease of notation, we omit the dependency on X s in the rest of this section.For example, f s,m means f s,m (X s ) and σ s,m means σ s,m (X s ).
In the fixed effects meta-analysis case, the overall fit is computed as the weighted mean with standard error In the random effects case, the between-study variance σ 2 s is first estimated pointwise.The estimator σs = max was introduced by DerSimonian and Laird (1986) and used by, e.g., Sauerbrei and Royston (2011).As opposed to, e.g., restricted maximum likelihood estimation, (10) may appear computationally attractive since it does not require iteration.However, as shown by Veroniki et al. (2016), iterative methods may give more accurate estimates.Next, the overall fit is computed using with standard error Formulas ( 8)-( 12) are the familiar weighted means formulas used in metaanalysis, and have been used by Sauerbrei and Royston (2011) in a similar setting, focusing on meta-analysis of univariate functions estimated by fractional polynomials.In the fixed effects case, fs is the estimated mean conditional on randomly pooling from the populations of the observed cohorts alone.Random effects analysis, on the other hand, estimates the marginal population effect f s across all potential studies.See Viechtbauer (2010, Sec. 2.3) for an excellent discussion of the interpretation of fixed vs. random effects meta-analyses.Confidence bands with level (1 − α) are readily obtained for either estimates as fs + z α/2 se fs , fs + z 1−α/2 se fs , where z q denotes the qth quantile of the standard normal distribution.
Tests for statistical significance of smooth terms can be performed by combining the p-values from each separate fit using methods for meta-analytic combination of p-values as summarized, e.g., in Becker (1994).In particular, let p s,m denote the p-value obtained in cohort m for the hypothesis H 0,m : f s (X s ) = 0 that the smooth term s is zero over the whole range of explanatory variables X s in cohort m, and let H A,m : f s (X s ) = 0 denote the alternative hypothesis.Such p-values can be computed using the methods in Wood (2012).The meta-analytic null hypothesis then states that all p-values are uniformly distributed between 0 and 1, i.e., H 0 : p s,m ∼ U (0, 1), m = 1, . . ., M , while the meta-analytic alternative hypothesis H A states that all p-values have the same unknown non-uniform density which is non-increasing in the test statistic (Birnbaum, 1954).A large number of methods exist for computing the combined p-values.For example, Stouffer's sum of z method (Stouffer et al., 1949) uses the Z-score where Φ is the standard normal distribution and Φ −1 its quantile function, and w m , m = 1, . . ., M are meta-analytic weights.Zaykin (2011) suggests defining the weights as the square root of the sample size, In order to compute either fixed effect estimates using ( 8) and ( 9) or random effect estimates using ( 11) and ( 12), we require the availability of some method for computing predictions and standard errors for each model term.In the GAM case, this requires knowledge of the basis functions and corresponding weights estimated in each study, as well as their covariance matrix.These quantities are readily available from software for fitting GAMs, like mgcv (Wood, 2017) or pyGAM (Servn and Brummitt, 2018).Importantly, the individual participant data are not required for any of these procedures, and the basis functions need not be equal across cohorts.

Function Estimation
In order to investigate the statistical performance of the meta-analytic approach presented in the previous section, we conducted a simulation study in which data were generated from the model All explanatory variables were independently uniformly distributed in [0, 1] and ∼ N (0, σ 2 ).The functional forms assumed were similar to those used by Marra and Wood (2012), and shown in Figure 5.
Datasets with 4,000 observations of (x 0 , x 1 , x 2 , x 3 , y) were independently sampled 1,000 times.Each set of simulations was run with residual standard deviation σ = 0.5 and σ = 1.For each dataset, the following three cases were considered: • In the mega-analysis case, all 4,000 observations were analyzed jointly.
This served as a gold standard, yielding the model that would be fit if all data were available to analyze with a single model.
• In the equal sample size case, the dataset was split into 5 "cohorts" of 800 observations each.Each cohort was analyzed independently, and the meta-analytic fit computed as outlined above.
• In the unequal range and sample size case, a first "cohort" was created by sampling 300 observations with x 2 < 0.5 from the full dataset, the second cohort by sampling 500 observations with x 2 ≥ 0.5 from the remaining observations, the third cohort by sampling 800 observations with x 1 < 0.5 from the remaining observations, the fourth cohort by sampling 1,000 observations with x 1 ≥ 0.5 from the remaining observations, and the fifth cohort contained the remaining 1,400 observations.Hence, this case has the same sample sizes as the unequal sample size case, but the ranges of x 1 and x 2 vary between cohorts.
In the latter three cases, fixed effects meta-analysis was conducted.In all cases, univariate smooth terms were estimated using cubic regression splines with 20, 10, 30, and 5 basis functions for f 0 (x 0 ), f 1 (x 1 ), f 2 (x 2 ), and f 3 (x 3 ), respectively.Second derivative smoothing was performed using generalized crossvalidation, and standard error computations for each term included the uncertainty about the overall intercept as described in Marra and Wood (2012).The smooth terms were subject to a point constraint at the midpoint x 0 = x 1 = x 2 = x 3 = 0.5 to ensure that the terms were identified and comparable across cohorts, cf.Appendix A. Both the meta-analytic fits and the mega-analytic fit were shifted along the y-axis to ensure that they summed to zero over [0, 1], making them comparable to the true functional forms.All computations were performed in R version 3.6.2(R Core Team, 2019) with the package mgcv (Wood, 2017).
Figure 6 shows the average fits over all simulations.One can hypothesize that splitting a dataset into smaller parts and performing smoothing separately might lead to oversmoothing compared to analyzing all data in a single model.Considering Figure 6 we see that this is the case for estimating f 2 (x 2 ) in the case with σ = 1, in which all meta-analysis cases slightly underestimated the two peaks of the true term.In the case of σ = 0.5, however, all meta-analysis estimated had very low bias.The two meta-analyses with unequal sample size, also had somewhat too smooth estimates of f 1 (x 1 ) in the σ = 1 case.Overall, however, the average fits in the meta-analysis cases were very close to the true curves.
Table 2 shows the root-mean-square error (RMSE) of the fitted terms over the range [0, 1].In both noise settings, the meta-analyses with equal and unequal sample size had only slightly higher RMSE than the mega-analytic estimates, and there did not seem to be any systematic difference between them.The meta-analysis with unequal range and sample size had somewhat higher RMSE for f 1 (x 1 ) and f 2 (x 2 ), corresponding to the two variables whose range differed    between cohorts.For the terms f 0 (x 0 ) and f 3 (x 3 ), the unequal range and sample size case had RMSE very close to the two other meta-analytic cases.
Table 3 shows the average coverage across [0, 1] of 95 % confidence intervals computed with (13).The coverage of the confidence intervals of the megaanalytic estimates were close to 95 %, as expected from Marra and Wood (2012), and always conservative.For the meta-analysis with equal and unequal sample size, the coverage varied between 88 % and 99 %.In particular for f 1 (x 1 ) and f 2 (x 2 ) the confidence intervals were somewhat too narrow for these scenarios.The unequal range and sample size case had poorer coverage for f 1 (x 1 ) and f 2 (x 2 ), varying between 82 % and 87 %.For f 0 (x 0 ) and f 3 (x 3 ), on the other hand, all three meta-analytic cases had very similar coverage.

Hypothesis Testing and Power
As mentioned in Section 3, p-values for smooth terms combined in a metaanalytic fit can be computed using methods for meta-analysis of p-values.Two issues are of particular interest in this regard; first, whether their distribution is close to uniform when the null hypothesis is true, and second, the power to reject a false null hypothesis.Simulation experiments were conducted to investigate both issues.A nonlinear functional form was assumed as shown in Figure 7, representing the lifespan trajectory of the volume of a brain region of interest.For the power analysis, it was assumed that a dichotomous group variable interacted with the lifespan trajectory, leading to slightly higher atrophy for members of the baseline group, especially in advanced ages.For analysis of the null distribution of p-values, the two groups had identical lifespan trajectories.The shapes of the functional forms were taken from an estimate of lifespan trajectory of cerebellum cortex volume using data from Center for Lifespan Changes in Brain and Cognition longitudinal studies (Walhovd et al., 2016;Fjell et al., 2017).Analyzing this type of smooth interactions is relevant, e.g., when investigating the impact of a given genetic variation on lifespan trajectories of brain measures (Walhovd et al., 2019).
Our goal was to compare the p-value distribution under the null hypothesis and the power to detect an interaction between the smooth function and group membership for a mega-analysis and for the meta-analysis approach developed in this paper.Cross-sectional measurements were simulated with age uniformly distributed between 4 and 94 years, and group memberships randomly allocated to 0 or 1 with equal probabilities.For the mega-analysis, all measurements were analyzed in a single GAM, while for the meta-analysis, the data were first split into 6 datasets and analyzed separately, before a meta-analytic p-value was computed.For reference, the power obtained when using a single dataset of size 1/6th of the total dataset was also computed.A total of 1,000 Monte Carlo samples were analyzed for each parameter setting.For the case of a nonzero group interaction, statistical power was computed as the fraction of the 1,000 random simulations in which the group interaction was significant at a 5 % level.In the first set of simulations, the total sample size was fixed at 3,000 while the residual standard deviation varied between 1,000 and 15,000.In the second set of simulations, the residual standard deviation was fixed at 3,500, and the total sample size varied between 900 and 3,000.In all cases, "cohort fits" were computed by randomly splitting the dataset into 6 equally sized parts.The GAM used to analyze the data was of the form where x 0 ∈ {0, 1} is an indicator for group membership and x 1 is age, and is a normally distributed residual.The parameter β 0 represents the offset effect of group membership, the smooth term f (x 1 ) represents the age trajectory of subjects in group 0, and f 2 (x 1 ) represents the difference between the smooth term of subjects in group 1 and subjects in group 0. Hence, subjects in group 1 have age trajectory given by f 1 (x 1 ) + f 2 (x 1 ).GAMs were fitted with the gam function in mgcv (Wood, 2017), using cubic regression splines to construct the smooth terms and generalized cross-validation for smoothing.The null hypothesis states that there is no difference between the lifespan trajectories across groups, i.e., the p-values were calculated for H 0 : f 2 (x 1 ) = 0 in the mega-analysis and separately for H 0,m : f 2 (x 1 ) = 0, m = 1, . . ., 6 in the meta-analysis.For the mega-analysis, the p-values were directly available from the model object returned by mgcv, which uses the methods described in Wood (2012).For the meta-analysis, we compared several different methods for combining p-values: Wilkinson's maximum p (Wilkinson, 1951), Tippet's minimum p (Tippet, 1931), the logit-p method (Becker, 1994), Fisher's sum of logs (Fisher, 1925), Edgington's sum of p (Edgington, 1972), and Stouffer's sum of z (Stouffer et al., 1949), using the implementations in the R package metap (Dewey, 2019).See Loughin (2004) for an in-depth comparison of methods for combining p-values.As all samples in the meta-analysis were of equal size, equal meta-analytic weights were used in Stouffer's sum of z ( 14).The other methods do not use weights.Tippet's minimum p method gave p-values closest to a uniform under the null hypopothesis under most parameter settings, while Stouffer's sum of z method typically gave highest power.The p-values resulting from these two methods are hence shown in the results in this section, while complete results for all methods can be found in the Supplementary Material.
Figure 8 shows quantile-quantile plots of the p-values obtained by metaanalysis, mega-analysis, and a fit of a single dataset in the case of no actual interaction between the group variable and the lifespan trajectories in the case with sample size 3,000 and residual standard deviation 3,500.Results for other values of these parameters were similar, and are shown in the Supplementary Material.The gray line shows the ideal reference line.All methods yielded p-values which deviated to some degree from the uniform distribution.Metaanalytic p-values computed using Tippet's minimum p method were close to the p-values obtained either in the mega-analysis or in the single data fit.p-values computed using Stouffer's sum of z, on the other hand, were considerably further from being uniformly distributed.The lack of uniformity for the p-values of the mega-analysis can be explained by the approximate nature of the algorithm used to compute them, due to the need take into account the uncertainty of the smoothing parameter λ, cf.Wood (2017, Sec. 6.12).
Figure 9 (left) shows power curves over a range of noise values, and Figure 9 (right) shows power curves over a range of sample sizes.In both cases, the mega-analytic approach outperforms the meta-analytic approaches.Stouffer's sum of z method obtained power closest to the mega-analysis, while Tippet's minimum p method had lower power.As expected, analyzing a single dataset, representing 1/6th of the total data, gave much lower power than either of the other two approaches.
To summarize, meta-analysis using Stouffer's sum of z method had power fairly close to that of a mega-analysis, at an increased risk of falsely rejecting true null hypotheses.On the other hand, meta-analysis using Tippet's minimum p method had risk of falsely rejecting a true null hypothesis close to that of a mega-analysis, at the cost of lower power.The other methods for combining p-values were somewhere inbetween these extremes, as shown in the Supplementary Material.

Case Study
We will now illustrate the proposed methods on brain imaging data from six European cohorts analyzed by Fjell et al. (2019), using the R package metagam which implements the methods described in this paper.The datasets contained measurements of sleep quality and hippocampal volume from the Berlin Study of Aging-II (BASE-II) (Bertram et al., 2013;Gerstorf et al., 2016), the Betula project (Nilsson et al., 1997), the Cambridge Centre for Ageing and Neuroscience study (Cam-CAN) (Taylor et al., 2017), Center for Lifespan Changes in Brain and Cognition longitudinal (LCBC) studies (Walhovd et al., 2016;Fjell et al., 2017), Whitehall-II (Filippini et al., 2014), and University of Barcelona brain studies (Abellaneda-Pérez et al., 2019;Rajaram et al., 2017;Vidal-Piñeiro et al., 2014).Self-reported sleep and hippocampal volume data from 2,843 participants (18-90 years) were included.Longitudinal information on hippocampal volume was available for 1,065 participants, yielding a total of 4,621 observations.Mean interval from first to last examination was 3.8 years (range 0.2-11.0years).Participants were screened to be cognitively healthy and in general not suffer from conditions known to affect brain function, such as dementia, major stroke, multiple sclerosis, etc. Exact screening criteria were not identical across subsamples.Detailed sample characteristics are presented in the Supplementary Material.
In Fjell et al. (2019), the data were analyzed jointly using GAMMs in a mega-analysis, taking into account both the clustering of repeated measurements within the same subject, and of subjects within a given cohort.However, the methods proposed in this paper enable this type of multi-cohort analysis also when the data cannot be shared.In this particular example we examine how hippocampal volume is related to age and to sleep quality as measured by the global score on the Pittsburgh Sleep Quality Index (PSQI) (Buysse et al., 1989).A low value of the PSQI variable indicates good sleep.
The following model was first fit to data from each cohort separately: y ij denotes hippocampal volume of subject i at timepoint j, x ij,1 is the age of subject i at timepoint j, x i,2 is the global PSQI score, and x i,3 is the sex of subject i. b i ∼ N (0, σ 2 b ) is the random intercept of subject i and ij ∼ N (0, σ 2 ) is the residual.The main effect of age is represented by f 1 (x 1 ).f 2 (x 1 )x 2 is a varying-coefficient term (Hastie and Tibshirani, 1993), in which f 2 (x 1 ) is a regression coefficient for sleep quality which varies smoothly with age.The model fits were computed using the gamm function in mgcv (Wood, 2017), except for the data from Whitehall-II, which did not contain repeated measurements and were computed using gam.Restricted maximum likelihood was used for both for estimating the smoothing parameter and estimation of random effect terms, and cubic regression splines were used as basis functions.The range of the age variable differed considerably between cohorts, as shown in the top part of Figure 10.Hence, both the knot placement and the number of knots used to fit f 1 (x 1 ) and f 2 (x 1 ) was determined for each cohort separately.The knots were placed within the range of x 1 in each cohort.Using the second derivative penalty (6), the smoothing is determined by λ rather than by the number of knots, as long as the number of knots is sufficiently high to allow for a wide range of functional forms; this was tested using the simulation procedure described in Wood (2017, Ch. 5.9).The sleep quality scores were similarly distributed across cohorts, as shown in the bottom part of Figure 10.Betula differs somewhat in shape from the others, due to a transformation that had to be applied to these data (Fjell et al., 2019).
Except for additional arguments to s specifying the type of basis function and the number of knots, the following code was used to fit the model for each cohort.The argument pc = 70 defines a point constraint f 2 (70) = 0 for the varying-coefficient term, ensuring that the estimates of this term are comparable across cohorts, as described in Appendix A. Details are shown in the Supplementary Material.library(mgcv) # Fit GAMM in cohort 1 cohort_gam1 <-gamm( Hippocampus ~s(Age) + s(Age, by = PSQI_Global, pc = 70) + Sex, data = cohort_data1, random = list(ID =~1), method = "REML" ) Figure 11 shows the fits of the term β 0 + f 1 (x 1 ) in ( 15) relating age to hippocampal volume, over the range of ages in each cohort.The fitted model objects returned by the functions gam and gamm contain the original data used to fit the model, as well as the responses.The strip_rawdata function from the metagam package removes all individual participant data from each model fit, returning an object containing only aggregate quantities that can be shared  without any individual data.The following lines attach the metagam package and then creates an object cohort_fit1, which does not contain any individualspecific data.

library(metagam) cohort_fit1 <-strip_rawdata(cohort_gam1)
For each cohort, the two code chunks shown above are all that is needed to obtain an object cohort_fit1 that can safely be shared in order to obtain a meta-analytic fit.
For the meta-analysis, we will focus on the effect of age on hippocampal volume including the intercept term, β 0 + f 1 (x 1 ), and the age-dependent effect of sleep quality on hippocampal volume, f 2 (x 1 ).To this end, we set up a grid over which to compute the estimates.The grid contains the range of ages from 20 to 90 equally spaced by 0.1 year, and the value of the sleep quality score set to x 2 = 1, such that f 2 (x 1 )x 2 = f 2 (x 1 ), representing the interaction between age and sleep quality score.The following code gathers the model fits from each of the six cohorts in a list, creates a grid over which to predict, and finally uses the metagam function from the metagam package to compute the meta-analytic fits.
# Combine fits from each cohort in a list cohort_fits <-list(cohort_fit1, cohort_fit2, cohort_fit3, cohort_fit4, cohort_fit5, cohort_fit6) # Create a grid over which to compute meta-analytic fits grid <-data.frame(Age = seq(from = 20, to = 90, by = 0.1), Sex = factor("Female", levels = c("Female", "Male")), PSQI_Global = 1 ) # Smooth function of x_1, including overall intercept metafit_age <-metagam(cohort_fits, grid, terms = "s(Age)", method = "DL", intercept = TRUE) # Age-varying slope of x_2, not including overall intercept metafit_psqi <-metagam(cohort_fits, grid, terms = "s(Age):PSQI_Global", method = "DL", intercept = FALSE) The argument method = "DL" specifies that random effects meta-analysis should be used, and computed using the DerSimonian-Laird estimator (10) of DerSimonian and Laird (1986) as implemented in the metafor package (Viechtbauer, 2010).Other options for computing the between-cohort heterogeneity are also available, including maximum likelihood (method= "ML") and restricted maximum likelihood (method = "REML"), again using the implementations provided by metafor.By default, predictions from each model are computed over the whole supplied grid, thus extrapolating the estimates from cohorts whose data cover only a subset of the grid.Arguments can be specified in order to compute the predictions from each model only within the range of variables used to fit it.In practice, this latter option does not have much impact, since the standard errors are large outside of the range of the variables used in the fit, and hence the corresponding predictions get a very low weight at these points.
Figure 12 shows the meta-analytic fits compared to the full data case.The plot showing the effect of age on hippocampal volume (left) includes the overall intercept, while the plot showing the effect of global PSQI score (right) does not include overall intercept and hence represents the offset effect.The estimated effects of age on hippocampal volume are very similar between the two approaches, although the meta-analytic fit lies somewhat above the mega-analytic fit for ages below 60 and has somewhat narrower confidence bands at low ages and wider confidence bands at high ages.As in Fjell et al. (2019), there seems to be no interaction between global PSQI score and hippocampal volume at any age, as can be seen by the confidence intervals covering zero in both cases.In the meta-analytic case, the estimated curve has a peak at around 60 years, as opposed to the straight line estimated by the full data analysis.However, the confidence bands obtained with the two methods are highly overlapping.
In order to quantify how much each study contributes to the meta-analytic fit at each value of an explanatory variable, we propose using dominance plots, visualizing σ2 s,m /se 2 fs for m = 1, . . ., M .The resulting plot is shown in Figure 13 (left) for the main effect of age on hippocampal volume.The plot shows that LCBC and Cam-CAN are the main contributors to the meta-analytic fit for ages up to around 50 years, after which the relative influence of the other studies starts increasing.Furthermore, the heterogeneity of the models fit in each study can be analyzed by computing Cochran's Q statistic (Cochran, 1954) over an explanatory variable, thus comparing fs,m for m = 1, . . ., M independently at each value of the explanatory variable.Figure 13 (right) shows a heterogeneity plot comparing the main effects of age in each study, with 95 % confidence intervals represented by the shaded gray areas.From age 60 and above, there seems to be evidence of heterogeneity among the contributing studies.Using metagam, such plots are obtained with the command: plot_dominance(metafit_age) plot_heterogeneity(metafit_age) We refer to the online package documentation for more examples on how to use metagam to compute meta-analytic fits.

Discussion
We have proposed and illustrated a flexible way to obtain meta-analytic fits of generalized additive mixed models in neuroimaging studies where individual participant data cannot be shared across cohorts.In the simulation studies, the meta-analytic procedure showed estimation performance close to that obtained in the ideal case, in which all data were analyzed in a single model, except that the meta-analytic estimates tended to have somewhat too narrow confidence intervals.Furthermore, the simulations showed that when testing for an interaction between a smooth function and a categorical variable, the distribution of p-values under the null hypothesis of no interaction, and the power to detect an actual interaction, were highly dependent on the chosen method for combining p-values, offering a trade-off between power and the probability of making false rejections.The proposed method is particularly useful when the variables under study have different ranges across cohorts, such that enforcing the same knot placement is suboptimal and might lead to nonidentifiable models.This is the case in many multi-cohort and consortium studies using neuroimaging data, where for instance age-range or patient distribution across a clinical indicator may vary considerably across samples.
One particular area of application for meta-GAM is imaging genetics.The need for very large sample sizes has long been recognized (Thompson et al., 2014), which imposes challenges due to privacy and data protection as well as practical issues regarding transfer, storage and processing of large amounts of neuroimaging data.These challenges have successfully been overcome in initiatives such as ENIGMA (Bearden and Thompson, 2017;Thompson et al., 2017) using a meta-analytic approach to gene discovery.Classic meta-analytic techniques are often inappropriate in situations where genetic effects are studied in interaction with other variables, such as age in a lifespan study.To test whether effects of genetic variants on a neuroimaging outcome measure vary as a function of age, or whether the lifespan trajectories of a neuroimaging outcome variable differ as a function of genetic variation (Piers, 2018;Walhovd et al., 2019), more complex modelling is needed.This functionality is provided by meta-GAM.As shown in Figure 8, this meta-analytic approach yielded superior power to detect effects in such situations compared to single studies, although not completely reaching the same statistical power as mega-analyses (McArdle and Horn, 1985) in cases of total sample size less than 2,000.Other examples of situations where meta-GAM would be applicable are when testing whether an effect varies as a function of another continuous variable, such as blood pressure, BMI or sleep duration.In all of these cases, the neuroanatomical outcome variable is expected to show a more complex relationship to the predictor variable than what can be captured by a parametric model.In these cases, meta-analytic GAM will be a powerful strategy to test genetic effects.Thus, we believe the present strategy may be a useful tool in neuroimaging genetics.
An alternative to the pointwise meta-analysis approach presented in this paper is to treat the fitted smooth functions from each cohort as samples from a Gaussian process (Murphy, 2012, Ch. 15).A meta-analytic fit could then be obtained by using these samples to estimate the parameters of a common smoothing kernel.This approach has been taken by Salimi-Khorshidi et al. (2011) for meta-analysis of neuroimaging data.Another alternative is using multiple imputation methods to generate synthetic data in each cohort with the same distributional properties as the original data, which can then be shared and analyzed in a mega-analysis (Little, 1993;Rubin, 1993;Nowok et al., 2016).Other possible extensions include accomodating potential correlation between the pointwise estimates in a given cohort using the robust variance estimation methods developed by Hedges et al. (2010), and to model the effect of cohort-specific covariates using multivariate meta-regression (Berkey et al., 1998).Also, deriving metaanalytic weights to use when combining p-values (Rosenthal, 1978) as in Section 4.2 could potentially yield p-values closer to those of the mega-analysis.
Although we have focused on the case in which data are not available in a single location, the proposed methods can also be useful in two-stage analysis with GAMs.In two-stage analysis, models are fitted separately for each cohort as described here, and then fit using meta-analytic techniques (Burke et al., 2016).This approach seems to be somewhat less efficient than analyzing the data jointly in a one-stage model (Boedhoe et al., 2019;Kontopantelis, 2018), but is useful when combining the data is impractical due to storage requirements or harmonization challenges (Sung et al., 2014).
The methods presented in this paper are implemented in the R package metagam, available at https://lifebrain.github.io/metagam/.

Conclusion
Here we propose and demonstrate an approach to meta-analysis of neuroimaging results in situations where parametric models might not be appropriate, such as is often the case, e.g., in lifespan research.Parametric models might not be able accurately to capture lifespan trajectories of most neuroanatomical volumes, here as demonstrated for hippocampus.We show how such data can be analyzed using meta-analysis of generalized additive (mixed) models, and demonstrate that this is a powerful approach using simulated as well as real multi-cohort longitudinal data from the Lifebrain consortium.We believe this approach can be successfully applied in a range of settings where neuroimaging variables are used as outcome, especially within lifespan and neuroimaging genetics research, and beyond.The estimated intercept in cohort m 2 would now be β0,m2 = β0,m2 + ∆ β0,m2 , where ∆ β0,m2 takes into account the difference between the sum-to-zero constraint in cohort m 1 and in cohort m 2 .This argument generalizes to any number of cohorts and smooth terms.Hence, sum-to-zero constraints of the form (A.1) for each smooth can be imposed independently in each cohort fit, as long as the estimated intercept β 0 is added to each smooth term before combining.This implies replacing fs,m with β0,m + fs,m in equations ( 8) and ( 11).An important point for interpretation is that when using this option, the meta-analytic estimate of β 0 + f s incorporates both differences between estimated intercepts and differences between estimated smooth terms across cohorts.
Another way to resolve this issue is by imposing a constraint for each smooth term, specifying a point at which it should be exactly zero (Wood, 2017, Ch. 5.4.1).If the same point constraints have been applied when fitting the GAM to the data from each cohort, the smooth terms are all on the same scale and can be combined meta-analytically as described in Section 3.This approach hence replaces (A.1) by f s,m (X pc s ) = 0, m = 1, . . ., M, (A.2) for some point X pc s which is identical across cohorts.An advantage of this approach is that it does not require the intercept to be included in the metaanalysis; hence the meta-analytic estimate fs contains only the smooth term.On the other hand, point constraint may lead to wider confidence bands for the smooth terms (Wood, 2017, Ch. 5.4.1).Also, this approach requires that point constraints are specified as part of the model to be fit to the data from each cohort.For the model fit to each cohort data in Section 5, a point constraint at 70 for the interaction term was imposed by the pc argument, as shown in the code example.Note that the confidence interval for a smooth term subject to point constraint (A.2) does not need to have zero width at the constraint point X pc s .The methods for constructing confidence intervals developed by Marra and Wood (2012) based on the work by Nychka (1988), take into account the uncertainty about the overall intercept as well as the uncertainty about the smooth term, and these typically yield better coverage properties than confidence intervals which only model the uncertainty of the smooth term.

Figure 1 :
Figure 1: Modeling lifespan trajectories.Example of modeling lifespan hippocampal volume with longitudinal data using linear mixed models with quadratic and cubic terms for age, as well as a generalized additive model.The black dots show individual observations and the black lines connect subsequent observations from the same individual.The GAMM was fitted with 20 cubic regression splines and a random intercept term for each individual, and the optimal smoothing parameter estimated with restricted maximum likelihood.

Figure 2 :
Figure 2: Constructing a smooth function from splines.The left panel shows four cubic regression splines, with knot locations indicated by the colored dots.The right panel shows the contribution from each spline when using weights (γ 1 , γ 2 , γ 3 , γ 4 ) = (1, 0.2, −0.2, 0.5), and the black curve shows the resulting smooth function fs(x).

Figure 3 :
Figure 3: Smoothing parameter controls wiggliness.Impact of the smoothing parameter λ on the lifespan trajectories of hippocampal volume from Figure 1.The black line shows the fit corresponding to λ = 434, which is optimal in terms of maximizing the restricted maximum likelihood.The gray lines show a range of fits with λ between 1 and 10 7 .The annotations show the function estimates in each extreme end of the range of smoothing parameters, as well as the optimal estimate.

Figure 4 :
Figure 4: Eigenvalues are sensitive to range of explanatory variables.Eigenvalue spectrum of the cross-product matrix over 100 random simulations using knot locations as shown in the left plot, with x uniformly distributed over range shown in legend.Error bars show the sample standard deviations.

Figure 5 :
Figure 5: Functions used in simulations.True functional forms used in simulations in Section 4.1.

Figure 6 :
Figure 6: Simulation estimates overlaid on true functions.Dashed black lines show true functions.The colored lines show mean fits averaged over 1,000 simulations as described in Section 4.1.

Figure 7 :
Figure 7: Lifespan trajectories with group interaction.Functional forms assumed for lifespan trajectories in Section 4.2.Subjects are assumed to belong to either group 0 or 1, whose mean lifespan trajectories differ as shown by the two curves.

Figure 8 :
Figure8: P-value distribution under the null hypothesis.Quantile-quantile plot of pvalues under the null hypothesis as described in Section 4.2, for the case of residual standard deviation equal to 3,500 and total sample size 3,000.Meta-analytic p-values were computed using both Stouffer's and Tippet's method, as shown by the legend.

Figure 9 :
Figure9: Statistical power to detect interaction.Results of statistical power simulations desribed in Section 4.2.Left: fixed total sample size 3,000 and varying noise level.Right: fixed noise level σ=3,500 and varying total sample size.Shaded areas around curves show 95 % confidence intervals computed using the R package Hmisc(Harrell Jr et al., 2019).Metaanalytic p-values were computed using both Stouffer's and Tippet's method, as shown by the legend.

Figure 10 :
Figure 10: Empirical distribution of explanatory variables.Raincloud plots (Allen et al., 2019; Whitaker et al., 2019) showing the distribution of baseline age (top) and global PSQI score (bottom) in the data from each cohort studied in Section 5.

Figure 12 :
Figure 12: Comparison of meta-analytic and mega-analytic estimates.Meta-analytic fits obtained as described in Section 5, compared to the corresponding fit obtained with full data.Left: effect of age on hippocampal volume, including the overall intercept.Right: Interaction effect of age and PSQI global score on hippocampal volume.

Figure 13 :
Figure 13: Dominance and heterogeneity plots.Dominance and heterogeneity plots forβ 0 + f 1 (x 1 ) in equation (15).Left: the relative contribution from each study to the metaanalytic fit over age.Right: Cochran's Q statistic for heterogeneity over age.Shaded areas represent 95 % confidence intervals.
16SV5538/16SV5536K/01UW0808/01UW0706/01GL1716A/01GL1716B.Cam-CAN: Initial funding from the Biotechnology and Biological Sciences Research Council (BBSRC), followed by support from the Medical Research Council (MRC) Cognition & Brain Sciences Unit (CBU).Work on the Whitehall II Imaging Substudy was mainly funded by Lifelong Health and Wellbeing Programme Grant G1001354 from the UK Medical Research Council (Predicting MRI Abnormalities with Longitudinal Data of the Whitehall II Substudy) to Dr Ebmeier.The Wellcome Centre for Integrative Neuroimaging is supported by core funding from award 203139/Z/16/Z from the Wellcome Trust.todata in cohorts m 1 and m 2 , where the smooth term is constrained according to the data in cohort m 1 , i.e.,x∈Xs,m 1 f s,m (x) = 0, m = m 1 , m 2 .This yields estimates β0,m + fs,m for m = m 1 , m 2 , and the terms fs,m1 and fs,m2 would be directly comparable.Instead constraining f s,m2 over its own data would lead to a shift ∆ β0,m2 in the intercept estimated in cohort m 2 , i.e., x∈Xs,m 1 f s,m2 (x) = ∆ β0,m2 + x∈Xs,m 2 f s,m2 (x) = 0.

Table 1 :
Spline coefficients for models described in Section 2.4.The coefficient γ 2 was not possible to determine for Barcelona and Whitehall-II.In addition, γ 1 for Barcelona and Whitehall-II, γ 3 for Whitehall-II, and γ 8 for BASE-II are severe outliers.

Table 2 :
Mean root-mean-square error of fitted terms in the case of equal sample sizes, unequal sample sizes, and mega-analysis, with residual standard deviation σ = 0.5 or σ = 1.Standard deviations across simulations are shown in parentheses.

Table 3 :
Mean coverage of 95 % confidence intervals for fitted terms in the case of equal sample sizes, unequal sample sizes, and mega-analysis, with residual standard deviation σ = 0.5 or σ = 1.Standard deviations across simulations are shown in parentheses.