Risk-period-cohort approach for averting identification problems in longitudinal models

In epidemiology, gerontology, human development and the social sciences, age-period-cohort (APC) models are used to study the variability in trajectories of change over time. A well-known issue exists in simultaneously identifying age, period and birth cohort effects, namely that the three characteristics comprise a perfectly collinear system. That is, since age = period−cohort, only two of these effects are estimable at a time. In this paper, we introduce an alternative framework for considering effects relating to age, period and birth cohort. In particular, instead of directly modeling age in the presence of period and cohort effects, we propose a risk modeling approach to characterize age-related risk (i.e., a hybrid of multiple biological and sociological influences to evaluate phenomena associated with growing older). The properties of this approach, termed risk-period-cohort (RPC), are described in this paper and studied by simulations. We show that, except for pathological circumstances where risk is uniquely determined by age, using such risk indices obviates the problem of collinearity. We also show that the size of the chronological age effect in the risk prediction model associates with the correlation between a risk index and chronological age and that the RPC approach can satisfactorily recover cohort and period effects in most cases. We illustrate the advantages of RPC compared to traditional APC analysis on 27496 individuals from NHANES survey data (2005–2016) to study the longitudinal variability in depression screening over time. Our RPC method has broad implications for examining processes of change over time in longitudinal studies.


Introduction
Age-Period-Cohort (APC) models are valuable for understanding phenomena that account for variability over time in outcome trajectories. The purpose of APC models is primarily to decompose data trends into age, period and cohort effects. Age effects have been described as phenomena associated with growing older; period effects as general influences that vary a1111111111 a1111111111 a1111111111 a1111111111 a1111111111 18]. Our approach was inspired by internal risk modeling techniques used for evaluating heterogeneous treatment effects in randomized trials [19][20][21][22]. In the APC context, this new framework obviates the identifiability problem by substituting the multidimensional risk index in place of chronological age. We refer to the method as the risk-period-cohort (RPC) approach. Implementing the RPC model involves using a multidimensional age-related risk index that combines chronological age and other risk factors (e.g., via multiple regression or a machine learning model) to describe the aging process with respect to an outcome of interest. Assuming three basic assumptions are met (discussed in section 2), the risk index will not be collinear with period or cohort, due to the inherent variability in the set of underlying risk factors. In theory, any suggested model for APC analysis that evaluates age-related risk along with period and cohort effects can be used with this technique.
In brief, we are proposing an alternative qualitative framework for APC analysis; the existing approaches use a framework which includes an identification problem. The major advantage of the risk-period-cohort (RPC) approach over all existing approaches is that (a) it does not have an identification problem and (b) it is able to better capture phenomena related to processes of change over the life course.
For illustration in outlining the RPC modeling approach, in section 2, we describe modifications to extend the Yang and Land's HAPC framework [4] for our RPC approach for APC analysis. In section 3 we describe simulation studies to evaluate the (1) relationship between the size of the chronological age effect in the risk model and the correlation between age-related risk index and chronological age and (2) proficiency of the risk modeling approach, given a correctly specified internal risk model, in recovering additional cohort and period effects. In section 4 we use a health data example to illustrate the advantages of RPC over traditional APC analysis for studying longitudinal variation in depression screening. We end with concluding remarks in section 5. We consider some extensions of the RPC approach in the appendix (S1 File).

The risk-period-cohort model
The proposed RPC analytic method requires three primary assumptions: 1. Age-related risk is not linearly dependent with chronological age.
2. Chronological age and other risk factors, time periods and birth cohorts necessary to evaluate these relationships are empirically represented in the data.
3. The risk model used to estimate age-related risk is correctly specified. If the risk model is misspecified (e.g. omitted confounding risk factors, risk factors on inappropriate scales, or lack of relevant interaction terms), then we may not be able to accurately estimate agerelated risk.
We now describe how RPC can be implemented using repeated measures data across individuals with modifications to extend the Yang and Land's HAPC framework [4] (see Fig 1).
Let Y 0 i be a continuous outcome of interest at baseline (hence the superscript) for i = 1,. . .,n subjects in our study. Assume chronological age at baseline and X i ,a vector of m baseline covariates (X 1i ,X 2i ,. . .,X mi ) T , are risk factors for Y 0 i that describe age-related risk for Y 0 i . An internal risk model with additive effects for chronological age and the set of risk factors X i is given by the following equation: This model describes how variability in the aging process can put a subject at higher or lower risk for the outcome of interest at baseline. Risk in the case of a continuous outcome refers to predicted values of the outcome for each subject based on a fitted risk equation (e.g., via multiple regression or a machine learning algorithm). While Eq (1) can be generalized to other outcome types (e.g., binary, count and time-to-event outcome types, as described in "Extensions of the Risk Modeling Approach" in the Supporting Information, S1 File) and other age-related risk structures (e.g., complex parametric effects involving interactions, nonlinear effects, non-parametric structures, autocorrelated error terms, and/or stratification to allow for heterogeneity across pre-defined groups), in this basic representation we include a residual error term assumed to be independent and identically distributed with zero mean and positive variance. The resulting multidimensional age-related risk index encapsulates predictive information embedded in each subjects' chronological age at baseline within the context of effects associated with other relevant risk factors: We can thus obviate the identification problem in traditional APC analyses by substituting the age-related risk dimension defined in (2) in place of chronological age (hence the RPC model), as long as we satisfy our three basic assumptions. Assumption 1 leads to the necessary condition for RPC that chronological age and age-related risk are not linearly dependent (perfectly collinear) or near linearly dependent (approaching collinearity as the sample size grows infinitely larger). This criterion can be restated formally as an achievement of the requirement that the absolute value of the correlation between chronological age and age-related risk Structural diagram of the HAPC model using RPC analyses (not including a quadratic risk effect on the outcome). This diagram is adapted similarly from work by Bell and Jones (2014) [9]. Each period and cohort will include the spectrum of individuals of different risk levels. However, this is cross-classified since each period does not include all cohort groups and vice versa.
https://doi.org/10.1371/journal.pone.0219399.g001 approaches a number strictly less than one as our sample size grows infinitely larger: where 0<δ n �1 as n!1. By having to satisfy assumption 1 we have eliminated most of the trivial cases (e.g., chronological age uniquely determines risk) where this correlation in (3) is expected to approach one as our sample size grows larger (see simulation study 1). We can then, alternately as an example, perform longitudinal RPC analyses with the repeated measures outcome of interest by including age-related risk, period and cohort in the proposed Hierarchical APC framework of Yang and Land [4] (see Fig 1): LevelÀ 2 betweenÀ subject model for i = 1,. . .,n jk individuals within the jth period for j =1,. . .,J time period and the kth cohort for k = 1,. . .,K birth cohort. There is a Level-1 residual error term ε ijk . The Level-2 model defines the group-level effect α jk as the sum of the intercept in the Level-1 model by an intercept term η 0 and random effects for each period τ 0j and cohort λ 0k . The period, cohort and Level-1 residuals are each assumed to be independent and identically distributed with zero mean and positive variance.

Stability of results using the RPC method
Above, we described three assumptions of RPC. Satisfying the three assumptions is imperative for obtaining a credible solution (e.g. age-related risk heterogeneity in simulation study 2). Further, if the correlation between chronological age and age-related risk is very high, say 0.97, then serious multicollinearity exists between age-related risk, period and cohort effects that could lead to less reliable results. We suggest that this correlation should be less than or equal to 0.85 [23] to avoid modeling a multicollinear system of age-related risk, period and cohort variables. We show that in practical scenarios similar to our simulation studies, this correlation will most likely be less than or equal to 0.85 in satisfying assumption 1. If, on the other hand, risk is relatively independent from chronological age (with a low absolute correlation between risk and chronological age), then the effect of chronological age is small (assuming the risk model has been internally derived). As a result, one might either want to remove age from the model entirely or continue to use risk to describe the aging process, given phenomena related to growing older in the research study is relatively independent of chronological age.
We also importantly note that assumption 3 for correct model specification should be interpreted in accordance with previous findings on internal risk modeling [19][20][21][22]. These findings suggest that if an internal risk model performs well in comparison to existing external models based on measures of predictive accuracy, then there might be clinical utility to the internal risk model. While externally-derived risk models may reasonably be considered if they perform accurately for the sample at hand, internal risk models maximize internal risk performance and for this reason we favor this approach.
In the APC literature in many APC models using repeated measures data across individuals, as mentioned, chronological age at baseline is often treated as a fixed effect while variables representing time period and birth cohort are random effects [4]. As a result in the RPC approach we create an age-related risk index using baseline measures to be used in place of chronological age in a similar manner in such APC models. The proposed risk modeling approach is not specific to age, in that the modeler can employ the same technique to build risk indexes that include period or cohort effects instead of age. Since age, period and cohort effects vary from each other and outcomes of interest in systematic ways, it might be reasonable in certain circumstances to consider risk models that have a different set of covariates in creating each of the three risk indices. These indices can then be used in subsequent longitudinal models in any way as fixed, random or mixed effects.

Predicted chronological age as age-related risk
Another way to conceptualize age-related risk is as a predicted chronological age after accounting for biological and sociological factors that influence the aging process. Predicted chronological age can be quantified by modeling baseline chronological age as a function of other risk factors (e.g., using multivariable linear regression).
An advantage of using predicted chronological age is that it is calculated on a similar scale as chronological age and as a result is a widely interpretable metric. A disadvantage is that one might not want to conceptualize age-related risk in terms of predicted chronological age independent of an outcome of interest. For example, in a longitudinal study of cardiovascularrelated events we might prefer to calculate a five-year age-related risk for a cardiovascular event.

Overview of simulation studies of the RPC approach
We present two Monte Carlo (MC) simulation studies using R, version 3.4.2, in the RStudio integrated development environment. In simulation study 1 we illustrate how the correlation between chronological age and age-related risk varies according to the contribution of chronological age in the internal risk model. In simulation study 2, we assume that our risk model is correctly specified (assumption 3) and recover the age effect within the context of the agerelated risk index. In that simulation, we also evaluate if we can additionally recover underlying cohort and period effects using the RPC approach. We set our sample size n = 10 000 for each MC replication in our simulation studies so that our results were robust and generalizable for large sample behavior. See Table 1 for the details of how we generated random variables and slope parameter patterns for these simulations.

Simulation study 1
We simulated a simplified risk prediction model for baseline health state with chronological age and two additional risk factors, socioeconomic position (SEP) and physical activity level and a residual error term assumed to follow a standard normal distribution: In this simulation, the aging process in relation to subject-level variation in baseline health state can be described as a function of chronological age, SEP and physical activity level. The distribution of the variables considered in this model as shown in Table 1 was simulated for what we considered practical purposes. We consider a study population with numerical chronological ages from 30 to 80, a normally distributed SEP variable and a variable activity that is non-normally distributed across the study population to consider a measure with skewness in a real population. We can replace these variables without loss of generality since we will be interested in the magnitude of the slopes in (5) (β,α 1 ,α 2 ) for these covariates, which in this setting is controlled via the simulation conditions independent of the type of variable.
We define an age-related risk index for the fitted value for the outcome, where We varied the slopes in (5) (β,α 1 ,α 2 ) in simulating data to understand how the contribution of the age effect can affect the correlation between chronological age and age-related risk. Loosely, the magnitude of the chosen effects represent null (0.0), small (0.2), moderate (0.4 and 0.6), large (0.8) and very large (1.0) effects on a standardized coefficient scale.
We fit the corresponding risk prediction model for (5) using linear regression. Then, we calculated the age-related risk index using (6). We calculated the size of the chronological age effect as the proportionate chronological age effect in the risk prediction model using the estimated parameters in (6): The overwhelming majority of the slope parameter patterns yielded useful age-related risk indices, containing information regarding the extent to which chronological age determines    Risk-period-cohort approach for averting identification problems in longitudinal models age-related risk (within the context of other effects; see Fig 2). The simulation showed that the correlation between chronological age and risk generally goes down or up according to contribution of chronological age in the risk prediction model. Table 2 includes all cases where the correlation between chronological age and age-risk were above the 0.85 threshold (29 cases). We found that the correlation between chronological age and risk was above the 0.85 threshold whenever the proportionate chronological age effect was greater than 0.60 (24/29 cases); the proportionate chronological age effect was still quite large, between 0.55 and 0.59, for the other five cases (see Table 2). When chronological age correlated with the outcome only weakly, then the age-related risk index was close to linear independence from chronological age (see Fig 2 when β = 0). However, the sole outlying point in Fig 2 for β = 0 (proportionate chronological age effect is incongruent with the near zero correlation between chronological age and age-related risk due to noise) is the trivial case where all slope parameters are zero, violating assumption 1. Table 2. Simulated slope parameter patterns in which the average correlation between chronological age and age-related risk was greater than or equal to 0.85.

Simulation study 2
For simplicity, we assumed that current health state can be simulated in the same manner as baseline health state from (5) with the exception that current health state also considers a cohort variable: We fit the model for (8) using linear regression. We assumed no period effect in this model in order to study the RPC model's capacity for recovering null effects (in addition to values of the parameter Γ). We also used the simulated data based on each slope parameter pattern to fit respective RPC models, incorporating the age-related risk index from Eq (6): Assuming we have recovered the age effect within the age-related risk index, if we could satisfactorily recover the cohort and period effects using RPC, we would expect Γ−Γ � = 0 and Λ � = 0. We flagged cases where the average absolute value of either of these two estimated values over the MC replications of each slope parameter pattern was greater than a trivially small effect size (0.01). In these cases, we might fail to precisely recover the simulated cohort effect and falsely identify some possibility of a nonexistent period effect.
We found that the RPC approach estimates accurately both the simulated cohort and period effects using our criteria for 1266 of the 1296 (98%) slope parameter patterns we tested. The 30 cases where period and cohort effects were not precisely estimated in the RPC model (see Table 3) consisted of all the trivial cases where chronological age was the only significant risk factor in the internal risk model (β>0,α 1 = 0,α 2 =0). The bias grew with the size of the chronological age parameter β. Therefore, in imposing the assumption of age-related risk heterogeneity (assumption 1) we would be able to recover all additional cohort and period effects using RPC in our simulated nontrivial test cases.

Real-world application of the RPC approach to model changes in depression screening outcomes from 2005-2016
We performed both traditional APC and RPC analysis on an illustrative example to demonstrate the potential consequences of not being able to estimate all levels of age, period and cohort effects in real settings.
The National Health and Nutrition Examination Survey (NHANES) is a program of the National Center for Health Statistics that began in 1960. The objective of the NHANES is to assess the health and nutritional status of individuals in the United States. The NHANES is a cross-sectional collection of surveys and other health examination data for a nationally representative sample of the resident, civilian, non-institutionalized U.S. population with approximately 5,000 individuals sampled each year [24,25].
We created an age-related risk index using predicted chronological age. The internal risk model for chronological age included sex, number of comorbidities (ranging from 0 to 11), race/ethnicity (Hispanic, Non-Hispanic White, Non-Hispanic Black and Other Race), ratio of family income to poverty level, and social support (year = married or living with partner, no = widowed, divorced, separated or never married) as predictor variables. We performed listwise deletion on missing values on the covariates for 27496 complete cases. In our risk model, all variables were statistically significant (p < 0.05) and the model R-squared and adjusted R-squared were both 0.23. The correlation between age-related risk and chronological age was 0.48, thus fitting well within our conditions for this correlation as described in simulation study 1.
In a traditional APC approach, we regressed the PHQ-9 total score on chronological age, time period and birth cohort. In the RPC approach we regressed PHQ-9 total score on agerelated risk, time period and birth cohort. For approximating a birth cohort for each individual, we used the even year in each two-year range (2006,2008,2010,2012,2014,2016) as the time period. We categorized both time period and birth cohort. We then grouped together birth cohorts every five years for our regression modeling. Due to the skewness of the PHQ-9 in our study population, we used a logarithmic + 1 transformation on this outcome. The APC approach was only able to estimate time period and birth cohort because we categorized birth cohort by five-year ranges and removed levels that would make it linearly  Risk-period-cohort approach for averting identification problems in longitudinal models dependent with chronological age and time period; nonetheless there was evidence of variance inflation (i.e., larger standard errors) for period and cohort effects within the APC model relative to those estimated in the RPC model (see Table 4). The RPC model additionally resulted in improved model fit statistics (R 2 and adjusted R 2 ) relative to the APC model. Age-related risk was significantly associated with depression screening in the RPC model, while chronological age was not significantly associated with depression screening in the traditional APC model. Point estimates for period effects in the two models were relatively similar, although the RPC model showed a significantly positive association between time periods for 2013-2014 and 2015-2016 (reference level 2005-2006) with PHQ-9 total score while the traditional APC model did not. The birth cohort point estimates were in a similar direction between the two models. However, the traditional APC model had lower point estimates (for the most part) and larger standard errors and did not have a significant birth cohort effect. Of note in the RPC model, there was a strong birth cohort trend beginning in 1946-1950 associated with an increase in depression scores. This positive association grows in magnitude across younger birth cohorts. Abbreviations: RPC = risk-period-cohort; APC = age-period-cohort; SE = standard error; p = p-value. In an alternative specification, we categorized every level of birth cohort and thus had to leave out a level in order to resolve the identifiability problem. The substantive results and conclusions for this alternative specification regarding age, period and cohort effects were similar however as the traditional APC approach presented in Table 4. https://doi.org/10.1371/journal.pone.0219399.t004 In Fig 3 we graphically depict the relationship between PHQ-9 total score and birth cohort grouped by quintiles of age-related risk. The figures shows, with considerable variability across age-related risk groups, an increase in the average PHQ-9 total score in moving across the x axis to younger birth cohorts. The increasing slope is most prominent for individuals in the fifth quintile (highest age-related risk) before data becomes more sparse in later birth cohorts. Risk-period-cohort approach for averting identification problems in longitudinal models We re-ran the APC model including the explanatory variables used in the risk model as covariates. The model fit improved substantially (R 2 and adjusted R 2 both equal to 0.12). However, the standard errors in general were still larger than in the RPC model. In this model one still needs to force a constraint in order to estimate the age, period and cohort effects due to linear dependency. The major difference in the results from this APC analysis (incorporating covariates) in comparison to the traditional APC model reported in Table 4 was that a significant (p-value ranging from 0.01 to 0.05) birth cohort effect was detected between 1951 and 1970. This effect was then no longer significant for the post-1970 birth cohorts.
For further confirmation by comparison to a mixed effects approach, we re-ran both the RPC and APC model using a multilevel framework. In the RPC multilevel model we included a fixed effect for risk and in the APC multilevel model we included fixed effects for chronological age and quadratic chronological age. Both models included random effects for period and cohort. The mixed effects results paralleled the results for the fixed effects regression models: the RPC model detected a significant positive age-related risk effect and positive cohort effect from 1951 onward, while the APC model did not detect an age or cohort effect.

Future work for evaluating RPC effects in a longitudinal study of cardiovascular-related events
Our real data application with the NHANES survey data used different waves of cross-sectional data. Thus, we were not able to evaluate age, period and period effects on the same individuals across time. In a future study we will evaluate age-related risk, period and cohort effects for cardiovascular-related events using repeated measures data across individuals. A possible hypothesis would be that meaningful RPC effects would be observed because older birth cohorts were at higher risk for cardiovascular outcomes, including mortality, in the 1980s compared to today due to higher prevalence of smoking and the advent of a variety of effective clinical and medical interventions since the 1980s (e.g. statins) [27].

Conclusions
Age-period-cohort (APC) models have been proposed as a longitudinal modeling solution that can decompose data trends for time-varying events into age, period and cohort effects. However, due to the identification problem, researchers have been hamstrung by challenges in recovering underlying APC effects arising from the inherent collinearity of these factors. Prior longitudinal studies suffer from a potentially critical limitation of understanding the full range of longitudinal variation when not properly examining all age, period and cohort effects [9,[11][12][13][14][15][27][28][29]. Our risk-period-cohort (RPC) method provides a straightforward way to understand such longitudinal variation by obviating the identification problem.
Our RPC approach essentially transforms chronological age into a multidimensional hybrid measure to describe age-related risk. Further, the RPC approach allows for contextualizing APC effects in relation to clinical outcomes and other risk factors while addressing different sources of subject heterogeneity. As such, we believe that the RPC framework can be usefully applied for many epidemiological studies of human development and gerontology involving large observational data sources, in that it allows for examining sources of variation over broad time durations.
Our first simulation study revealed that in most of our tested cases, the age-related risk index provided for reasonable levels of multicollinearity among risk, period and cohort effects. The risk index also informs the researcher about the size of the chronological age effect in contributing to age-related risk. In using the RPC approach, as revealed in our simulation study 1, it is important to check (1) the correlation between the age-related risk index and chronological age and (2) the size of the chronological age effect (i.e. proportionate chronological age effect) in the risk prediction model. In practice, model adjustments in line with theory and prior empirical evidence might be considered to optimize these two indices before performing subsequent longitudinal RPC analyses. Although we anticipate that the values of these two indices should generally be below acceptable thresholds described above, investigators might employ remedial measures when the indices are above those thresholds (e.g., inclusion of additional clinically relevant risk factors for age-related risk in the risk prediction model when such factors are available).
Existing methods that attempt to resolve the identification problem may fail to recover underlying APC effects due to variance inflation resulting from multicollinearity of these factors. Our second simulation study showed in our test cases, given that the internal risk model is correctly specified and recovers the age effect within the age-related risk index, that the RPC approach was able to additionally recover both the simulated cohort and period effects in all nontrivial cases. These simulation studies showed the necessity of the assumption 1 of agerelated risk heterogeneity. The simulation results showed that the RPC model should be useful for many applied problems. However, there are other scenarios encountered in practice not covered in our simulation studies that will need to be studied further.
Our real data application of the RPC approach on the NHANES survey data illustrated the advantages of being able to recover all levels of APC effects without making constraints to resolve the identifiability problem. Smaller standard errors were observed in the RPC model as compared to a traditional APC model under different specifications. Of primary concern was that the traditional APC model may have missed substantively important longitudinal variation in depression screening between 2005-2016. The RPC model found a significant age effect and a cohort effect beginning around 1946-1950 and continuing across subsequent birth cohorts that was absent from the traditional APC models.
In this paper we have described the RPC approach for obviating identifiability in APC models with broad implications for accounting for time-varying events that impact longitudinal studies. In future studies we will seek to apply this approach using longitudinal data from national cohorts and from electronic health records. Such future work is needed to examine the empirical value of the RPC approach using repeated measures data on individuals.