SAS Macros for Calculation of Population Attributable Fraction in a Cohort Study Design

The population attributable fraction (PAF) is a useful measure for quantifying the impact of exposure to certain risk factors on a particular outcome at the population level. Recently, new model-based methods for the estimation of PAF and its conﬁdence interval for diﬀerent types of outcomes in a cohort study design have been proposed. In this paper, we introduce SAS macros implementing these methods and illustrate their application with a data example on the impact of diﬀerent risk factors on type 2 diabetes incidence.


Introduction
Quantification of the impact of exposure to modifiable risk factors on a particular outcome, such as death or disease occurrence, at the population level is a fundamental public health issue. Population attributable fraction (PAF) assesses the proportion of the outcome attributable to exposure to such modifiable risk factors in a given population by estimating the proportion of outcome that would not have occurred if all individuals had belonged to the low-risk reference category of those factors (for example, what proportion of deaths would not have occurred if nobody had started smoking). Both relative risk (RR) and the prevalence death and disease occurrence are assumed. In Section 2, we give a definition of PAF for a time interval (0, t]. In Section 3, we propose an estimate of PAF for total mortality and disease incidence and derive its asymptotic variance, both in the total population and in subpopulations. In Section 4, we explain the SAS macros for the estimation of PAF for total mortality and disease incidence in the presence of potential effect modification. In Section 5, we illustrate the application of these macros. Finally in Section 6, we discuss the strengths and weaknesses related to our program and its application.

Concept of PAF in cohort study design
Consider the occurrence of an outcome A in a population of n individuals with risk factor values X i = (x i1 , . . . , x im ) , i = 1, . . . , n. In a cohort study design, PAF is defined to be the proportion of the outcome occurrence that could be avoided during a certain follow-up time (T ), which is determined as the time from baseline (t = 0) to the time of the event of interest or censoring (whichever comes first), if it was possible to change some risk factor values to their chosen target values, X i = (x i1 , . . . , x im ) → X * i = (x * i1 , . . . , x * im ) (Laaksonen et al. 2010b). In this notation, X i is the vector of all risk factors of the ith individual considered relevant (modifiable, non-modifiable, and confounding factors), and thus only the modifiable risk factors whose effect we wish to measure will have a different value in X * i while the rest of the factors retain their values. The PAF is then where P{A i |X i } is the probability of the occurrence of outcome (A) for the ith individual with the risk factors X i .
In this study, we are interested in calculating PAF both for occurrence of a terminal outcome, such as death, and for order of occurrence of two terminal outcomes, such as occurrence of a chronic disease before death, during a time interval (0, t]. If the outcome of interest is death, PAF is the proportion of mortality that could hypothetically be avoided during a time interval (0, t] if its risk factors were modified (Laaksonen et al. 2010b). Let T M denote the time of death. Then the proportion of excess mortality up to time t due to certain modifiable risk factors in X i is given by where P{T M i ≤ t|X i } is the probability of death up to time t, given the risk factor values X i . If, however, the outcome of interest is incidence of disease, PAF is the proportion of disease cases that could hypothetically be avoided during a time interval (0, t] if its risk factors were modified. In this case, mortality before contracting the disease of interest causes selection in the population during follow-up (Laaksonen et al. 2010a). If the risk factors that are related to the incidence of the disease of interest are also related to mortality, the modification of these risk factors is likely to affect both the risk of the disease and the risk of death. Therefore, censoring due to death needs to be taken into account in the definition of PAF for disease incidence. The importance of considering censoring due to death in the estimation of PAF for disease incidence has been demonstrated elsewhere (Laaksonen et al. 2010a). Let T D denote the time of the occurrence of the disease. Then the proportion of excess disease incidence up to time t due to certain modifiable risk factors in X i is given by where P{T D i ≤ min(T M , t)|X i } is the probability of the disease incidence up to time t, given the risk factor values X i .
We thus need two different definitions of PAF, PAF(T M ≤ t) and PAF(T D ≤ min(T M , t)), depending on the outcome of interest. Furthermore, to analyze the impact of some potential effect modifying factor on the relationship between the risk factor and the outcome of interest at the population level, we can calculate PAFs in the subpopulations defined by different categories of the effect modifying factor. By calculating the differences between these PAF estimates and their confidence intervals we can further assess the statistical significance of the effect modification.
3. Estimation of PAF in cohort study design 3.1

. General model assumptions
The following assumptions in the calculation of PAF for total mortality or for disease incidence in the cohort study design are made in this study. Proportional hazards models are applied. The hazard of death is h M (t) and the hazard of disease incidence h D (t). The corresponding cumulative hazard functions are then H M (t) = For each individual, the hazard functions are assumed to depend on all relevant risk factors (X) for both mortality and disease incidence: h M (t) := h M (t; X) and h D (t) := h D (t; X). The time of death T M and the time of the occurrence of disease T D are assumed to be conditionally independent given X. The hazard function for disease-free survival at time t, min(T M , T D ) > t, is assumed to be h M (t; X) + h D (t; X). Then, the probability that the first event is disease is There may still be right-censoring by T C which is assumed to be conditionally independent of T M and T D given X. If the outcome of interest is death, we then observe T C = min{T C , T M } in case of right-censoring or T M = min{T C , T M } in case of death. If the outcome of interest is incidence of disease, we observe It is important to note that the definition of PAF does not depend on T C .

Piecewise constant hazards model
In the calculation of PAF, the times T D and T M are assumed to follow a proportional hazards model with piecewise constant baseline hazard functions, given X. A parametric piecewise constant hazards model is chosen due to its flexibility in accommodating to the shape of the underlying survival curve and ease of computation (Laaksonen et al. 2010a,b). In a parametric piecewise constant hazards model, the follow-up time is partitioned into J intervals (0 = a 0 , a 1 ], . . ., (a j−1 , a j ], . . ., (a J−1 , a J ], where a j−1 < a j for all j and the hazard for the ith individual is allowed to depend on time by letting the baseline hazard λ 0j change at times a j (Friedman 1982). A log-linear function between the risk factors and the hazard function is thus assumed. The effect of age can be taken into account by dividing the range of individual dates of birth , and then further stratifying the baseline hazard by them (λ 0jb i ), where b i is the birth cohort for the ith individual (Korn, Graubard, and Midthune 1997).
Let us thus denote the hazard of death at time t for the ith individual given the birth cohort b i and the risk factors and the hazard of disease incidence as and In this notation, α M jb i = log λ M ojb i is the logarithm of the baseline hazard of death (λ M 0jb i ) and α D jb i = log λ D ojb i the logarithm of the baseline hazard of disease incidence (λ D 0jb i ). Similarly, β M and β D are the vectors of regression coefficients for death and disease incidence, respectively, for the covariates X i , which can be either categorical, continuous or their interactions. Furthermore, Z ij is the vector with length J×B + m including J×B indicators of time interval and birth cohort and the covariates corresponding to the regression coefficients γ M = (α M 11 , . . . , α M JB , β M 1 , . . . , β M m ) and γ D = (α D 11 , . . . , α D JB , β D 1 , . . . , β D m ) . The λ * M ij and λ * D ij follow similarly by replacing X i by X * i in (3) and (4).

PAF for total mortality
The estimation of PAF for total mortality is described in detail elsewhere (Laaksonen et al. 2010b) and is only briefly summarized here.
The probability of death during (0, a j ] for the ith individual given the birth cohort b i and the risk factors X i can be calculated as is the survival up to time a j and j ∈ {1, . . . , J}. The S * M iJ follows similarly by replacing X i by X * i in this formula. Thus, according to Equation 1 the PAF for total mortality during (0, a J ], PAF M (0,a J ] , can be calculated as The PAF for total mortality at any chosen interval (a j−1 , a j ], PAF M (a j−1 ,a j ] , can be calculated similarly by using probabilities In order to estimate the PAF for total mortality, written briefly PAF in here, we first need to estimate the model parameters γ M . Estimation of these parameters is based on data of the individual follow-up times until death or censoring, whichever comes first: T i = min(T M i , T C i ). In this study, maximum likelihood estimation is used and the SAS procedure LIFEREG is used to compute these maximum likelihood estimatesγ M and their estimated covariance matrix Σ M . The point estimate of PAF, PAF, is then obtained by replacing the unknown parameter values γ M in Equation 5 by their maximum likelihood estimatesγ M . A symmetrizing monotone strictly increasing complementary logarithmic transformation of PAF, g(PAF) = log (1 − PAF), is used to obtain an approximate 95% confidence interval of PAF. According to the delta method ). The confidence interval of the transformation of PAF is then obtained by where the limiting variance of g(PAF) can be consistently estimated bŷ The confidence interval is finally transformed back to the original scale using the inverse of the complementary logarithmic transformation

PAF for disease incidence
The estimation of PAF for disease incidence is described in detail elsewhere (Laaksonen et al. 2010a) and is only briefly summarized here.
The probability of disease occurrence, when also the time of death is taken into account, for the ith individual given the birth cohort b i and the risk factors X i , can be calculated as is the disease-free survival up to time a j . Thus, according to Equation 2 the PAF for the incidence of disease during (0, a J ], PAF D (0,a J ] , can be calculated as In order to estimate the PAF for disease incidence, written briefly PAF in here, we first need to estimate the model parameters γ D and γ M . Estimation of these parameters is based on data of the individual follow-up times until the occurrence of the disease, death or censoring, whichever comes first: Similarly to the Section 3.3, the maximum likelihood estimation method is used. The maximum likelihood estimatesγ D andγ M are asymptotically independent (the Fisher information matrix is block-diagonal). The asymptotic distribution of The point estimate of PAF, PAF, is obtained by replacing the unknown parameter values γ D and γ M in Equation 9 by their maximum likelihood estimatesγ D andγ M . The approximate 95% confidence interval of the transformation g(PAF) = log (1 − PAF) is then obtained as in Equation 6, where the limiting variance of g(PAF) can be consistently estimated bŷ The confidence interval is finally transformed back to the original scale according to Equation 8.

PAF in the presence of potential effect modification
In the calculation of PAF, we may want to consider the potential effect modification, i.e., whether the relationship between the risk factor and the outcome of interest, and thus potentially also PAF, varies according to the values of a potential effect modifying factor. Here, the potential effect modifying factor is assumed to be categorical. To analyze the impact of the potential effect modifying factor, an interaction term between the risk factor and the potential effect modifying factor is included in the model, giving separate estimates for the risk factor in the different categories of the potential effect modifying factor. Separate PAF estimates are then calculated in the subpopulations defined by the categories of the effect modifying factor. The statistical significance of interaction can be determined by calculating the 95% confidence intervals for the differences between these PAF estimates. If the confidence interval does not cover zero the difference between the PAF estimates is considered to be statistically significant. For example, in case of an effect modifying factor with two categories, we calculate two separate PAF estimates PAF 1 and PAF 2 and estimate the difference PAF 1 − PAF 2 and its 95% confidence interval where PAF is used to denote either PAF for total mortality or PAF for the incidence of disease. The variance of PAF difference is obtained using the delta method, where the limiting variance of PAF 1 − PAF 2 can be consistently estimated bŷ Similarly, we can also use main effect models and calculate and compare PAF estimates in subpopulations defined by some other factor of interest (such as sex) included in the model concerned.

SAS modules
The estimation procedure of PAF for total mortality or disease incidence, in the presence of confounding factors and effect modification, is organized as a sequence of SAS macros. This section outlines the functionality of these macros so that an advanced user can make use of them. The use of these macros requires SAS 9.2 procedures LIFEREG, LOGISTIC, TRANSPOSE, SQL, and IML (SAS Institute Inc. 2010).
To perform PAF analysis, a data preparation procedure is required to create input data files for the main SAS macro, PAF_M for total mortality or PAF_D for disease incidence (see Figure 1). The main macros, PAF_M and PAF_D, are composed of the following steps: 1. The main macro calls the macro EST_MATRIX to prepare the design matrices (Z and Z * in Equation 3  2. The main macro calls the macro EST_PAF_M for total mortality or EST_PAF_D for disease incidence to calculate the PAF estimates, their standard errors and 95% confidence intervals (the IML procedure) using the formulas provided in Sections 3.3 and 3.4.
3. The main macro prints out the relative risks and PAF estimates together with their 95% confidence intervals for the risk factors of interest (a more comprehensive output from the LIFEREG procedure is optional).
When subpopulation analyses with respect to a certain factor of interest are made, the design matrices (Z, Z * ) created at Step 1 are divided into separate design matrices (Z 1 , Z * 1 , . . ., Z K , Z * K ) according to the K categories of this factor. Then, the macros at Step 2 are called K times, and the K(K − 1)/2 differences between the subpopulation-specific PAF estimates are analyzed by calling yet another macro (EST_PAF_DIFF_M for total mortality and EST_PAF_DIFF_D for disease incidence) in which the formulas given in Section 3.5 are implemented. A more detailed description of the functioning of all these macros is given in Section 4.2 after the description of the data preparation in Section 4.1.

Preparation of data for PAF analysis
In the original data matrix, the rows usually correspond to individuals. On the other hand, the columns correspond to individual attributes: identification number (ID), birth year (BYEAR, birth cohorts (B_COHORT), and risk factors of interest. The categorical risk factors include: gender (SEX), blood pressure status (BP), smoking status (SMOKE), age group (AGEGRP) and body mass index (BMI_2, indicating whether BMI is less than (= 1) or greater than or equal to (= 2) 25kg/m 2 ). The risk factors can also be continuous; for example, age (AGE) and body mass index (BMI). Refer to the SAS data set example. If the outcome of interest is death, a binary variable (0/1) indicating whether the person died during the follow-up (DEATH) and the follow-up time to the occurrence of death or censoring (DEATH_FT) should be included in the data matrix (see SAS data set example). If the outcome of interest is disease, also a binary variable (0/1) indicating whether the disease occurred during the follow-up (DIAB) and the follow-up time to the occurrence of the disease or censoring (DIAB_FT) should be included in the data matrix (see SAS data set example). When the order of the occurrence of the disease and death is followed for a certain time, for each individual we observe one of the four possible combinations demonstrated, theoretically, in Table 1 and, in practice, in Table 2 which shows a sample of four individuals from the SAS data set example.
For the PAF analysis, the original data matrix must be duplicated according to the number of follow-up time intervals used in the piecewise constant hazards model, to create input data Table 1: Four different possible types of observed events and notation related to them.  matrix in which one row corresponds to one follow-up time interval. The total follow-up time must be divided into these intervals (F_TIME). The input data matrix must also contain a categorical variable which indicates the cumulative follow-up time by the beginning of each follow-up time interval (F_PERIOD). If the baseline hazard in the piecewise constant hazards model is stratified by the birth cohort, a categorical variable, birth cohort indicator (B_COHORT), should be included in the input data matrix. Note that it is the user's responsibility to ensure that the choice of the follow-up time intervals and birth cohorts leads to sufficient cases in all strata of the baseline hazard variables (no zero cells) for reliable estimation of PAF.
If the outcome of interest in the PAF analysis is death, the follow-up is continued until the occurrence of death or censoring, and the follow-up time until the first event must be divided into the chosen follow-up time intervals. Thus, if the length of follow-up time intervals is chosen to be 5 years and the length of birth cohorts 20 years, the original data matrix in Table 2 needs to be modified to appear as the data matrix in Table 3 for the estimation of PAF for total mortality.
If the outcome of interest in the PAF analysis is disease incidence, the follow-up is continued until the occurrence of the disease, death or censoring, and if the first event is disease, then the follow-up time until death or censoring (DEATH_FT) needs to be set equal to the time of the occurrence of disease (DIAB_FT) and the indicator of death (DEATH) to zero (if it originally was one indicating that death would have occurred later during the follow-up). Two separate input data files for disease and death, in which the follow-up time intervals and birth cohorts are of the same length, must be created. The original data matrix in Table 2 thus needs to be modified to appear as the data matrices for death in Table 4 and for disease in Table 5 (where the length of follow-up time intervals is chosen to be 5 years and the length of birth cohorts 20 years) for the estimation of PAF for disease incidence.   OUTDATA = SAS data set specifies output data set.
If PAF for total mortality is to be estimated, the outcome of interest is death and only censoring variable (CENSORVARIABLE_1), censoring value (CENSORVALUE_1) and follow-up time variable (FTVARIABLE_1) related to death need to be given in the GEN_DATA macro call to prepare the required input data set. In the preparation of this data set, a new follow-up period variable (F_PERIOD) indicating the cumulative follow-up time by the beginning of each followup time interval is created, the total follow-up time until the occurrence of death or censoring is divided into different follow-up time intervals (F_TIME), the information not needed (related to disease incidence or total follow-up time) is dropped, and the information that remains the same from one interval to another (birth year (BYEAR) and cohort (B_COHORT), and the risk factors of interest measured at baseline) are dropped (see SAS data set example_death in Table 3).
If PAF for disease incidence is to be estimated, the outcome of interest is disease, but also censoring due to death is taken into account, and thus separate input data sets for disease and death are prepared for the PAF analysis. When the input data set for disease is prepared (see SAS data set example_disease in Table 5) only censoring variable (CENSORVARIABLE_1), censoring value (CENSORVALUE_1) and follow-up time variable (FTVARIABLE_1) related to disease need to be given in the GEN_DATA macro call. When the input data set for death is prepared (see SAS data set example_death_2 in Table 4) both censoring variable (CENSORVARIABLE_1), censoring value (CENSORVALUE_1) and follow-up time variable (FTVARIABLE_1) related to death and censoring variable (CENSORVARIABLE_2), censoring value (CENSORVALUE_2) and follow-up time variable (FTVARIABLE_2) related to disease need to be given in the GEN_DATA macro call. If the first event is disease (CENSORVALUE_2 = 1), then the follow-up time until death or censoring is set equal to the time of the occurrence of disease and the indicator of death (DEATH) is set to zero .
A more detailed description of the function of the GEN_DATA macro is given in the SAS program gen_data.sas.

PAF analysis
When the input data matrices have been prepared as described in previous Section 4.1, a SAS macro, PAF_M, for the estimation of PAF for total mortality:  Tables 3 and 4).

CENSORVARIABLE_M = variable
indicates whether censoring before death occurred. Must be a binary variable (0/1) (see variable DEATH in data sets example_death and example_death_2 in Tables 3 and 4 in Section 4.1). CENSORVALUE_M = 0 or 1 specifies value of the CENSORVARIABLE_M = variable (default value is 0).

TIMEVARIABLE_M = variable
indicates the individual length of follow-up until death or censoring within each followup time interval (see variable F_TIME in data sets example_death and example_death_2 in Tables 3 and 4 in Section 4.1).

PERIODVARIABLE_M = variable
indicates the cumulative follow-up time by the beginning of each follow-up time interval in input data set for death (see variable F_PERIOD in data sets example_death and example_death_2 in Tables 3 and 4 in Section 4.1). DATASET_D = SAS data set specifies input data set for disease. Must be of the form explained in the previous Section 4.1 (see data set example_disease in Table 5).

CENSORVARIABLE_D = variable
indicates whether censoring before incidence of disease occurred. Must be a binary variable (0/1) (see variable DIAB in data set example_disease in Table 5 in Section 4.1).

TIMEVARIABLE_D = variable
indicates the individual length of follow-up until disease occurrence or censoring (due to death or end of follow-up) within each follow-up time interval (see variable F_TIME in data set example_disease in Table 5 in Section 4.1).

PERIODVARIABLE_D = variable
indicates the cumulative follow-up time by the beginning of each follow-up time interval in input data set for disease (see variable F_PERIOD in data set example_disease in Table 5 in Section 4.1).

IDVARIABLE = variable
identifies each individual (see variable ID in data sets example_death, example_death_2 and example_disease in Tables 3, 4, and 5 in Section 4.1).

COHORTVARIABLE = variable
indicates to which birth cohort the individual belongs (see variable B_COHORT in data sets example_death, example_death_2 and example_disease in Tables 3, 4

GROUPVARIABLE = variable
defines a categorical variable of interest that constitutes the subgroups in which the PAF estimates are calculated separately. If the variable is considered to be a potential effect modifying factor, an interaction term between the variable and the risk factor of interest must be included in the COVARIATE_MODEL statement. If GROUPVARIABLE is omitted, the PAF estimate is computed using the entire sample.

MODIFICATIONS = variable=value
determines the reference category value for the risk factors of interest to which individuals are hypothetically moved in the calculation of PAF. If individuals from only some of the categories of the risk factor are moved to the reference category, then the restriction statement is denoted using the IF-THEN statement of the SAS language. If several risk factors are modified simultaneously, they are separated by slashes. The variables included in the COVARIATE_MODEL statement but not in the MODIFICATIONS statement are treated as confounding factors.

PRINT = YES or NO
defines what is printed out from the LIFEREG procedure: PRINT = YES requests all output results while the PRINT = NO option prints only the relative risks (RR) and their 95% confidence intervals (CI; formed with the help of the PROC LIFEREG output). The default is PRINT = NO.

PAF_M macro
If the outcome of interest is death, input data matrix (DATASET_M), censoring variable (CENSORVARIABLE_M) and censoring value (CENSORVALUE_M) related to death are given in the PAF_M macro call. A more detailed description of the function of this macro is given in the SAS program paf_m.sas. First, the PAF_M macro calls another macro, EST_MATRIX, which prepares the design matrices, vector of parameter estimates and covariance matrix of parameter estimates related to death needed for the calculation of PAF and its confidence interval for total mortality. In addition, the macro produces the relative risks (RR) of death and their 95% confidence intervals for the risk factors of interest given in the COVARIATE_MODEL= option of the PAF_M macro call (a more comprehensive output from the LIFEREG procedure can also be obtained using the PRINT option of the PAF_M macro). Also the convergence status of the LIFEREG analysis is provided. Outputs of PAF_M are listed as follows: the data set DESIGN_1 which indicates to which categories of the baseline hazard variables (follow-up time intervals, birth cohorts, and their interactions) each individual belongs on each follow-up time interval and which values of the risk factors each individual has (matrix Z in (3)).
the data set DESIGN_2 which indicates to which categories of the baseline hazard variables (follow-up time intervals, birth cohorts, and their interactions) each individual belongs on each follow-up time interval and which values of the risk factors each individual has after their hypothetical change to the chosen reference categories indicated in the MODIFICATIONS statement of the PAF_M macro call (matrix Z * ).
the data set &CENSORVARIABLE_M._EST which contains the column vector of parameter estimates for the baseline hazard variables (α M ) and the risk factors (β M ) related to death, where &CENSORVARIABLE_M is substituted with the name of the censoring variable given in the PAF_M macro call (γ M in (3)).
the data set &CENSORVARIABLE_M._COVB which contains the covariance matrix of the parameter estimates for the baseline hazard variables and the risk factors related to death, where &CENSORVARIABLE_M is substituted with the name of the censoring variable given in the PAF_M macro call (Σ M in (7)).
the data set &CENSORVARIABLE_M._RR which contains relative risks of death and their 95% confidence intervals for the risk factors given in the COVARIATE_MODEL statement of the PAF_M macro call, where &CENSORVARIABLE_M is substituted with the name of the censoring variable given in the PAF_M macro call.
the data set &CENSORVARIABLE_M._CONV which indicates the convergence status of the model estimation of PROC LIFEREG.
The function of this macro is described in more detail in est_matrix.sas. Second, the PAF_M macro calls a macro EST_PAF_M which calculates the point estimate, standard error and 95% confidence interval of PAF for total mortality using the formulas described in Section 3.3. A more detailed description of these calculations in the SAS/IML language is provided in est_paf_m.sas. Finally, the PAF_M macro prints out both the relative risks of death and their 95% confidence intervals for the risk factors specified in the COVARIATE_MODEL= option of the macro call as well as the piecewise and cumulative point estimates, standard errors and 95% confidence intervals of PAF for total mortality. If there is a problem in the convergence of the model chosen, an error message is generated.

PAF_D macro
If the outcome of interest is disease, input data matrices (DATASET_M, DATASET_D), censoring variables (CENSORVARIABLE_M, CENSORVARIABLE_D) and censoring values (CENSORVALUE_M, CENSORVALUE_D) related to both death and disease are given in the PAF_D macro call. In this case, the follow-up time must be divided into time intervals of the same length (TIMEVARIABLE_M, TIMEVARIABLE_D) and birth cohorts (B_COHORT) must be of the same length in both input data matrices (see SAS data sets example_death_2 and example_disease in Tables 4 and 5 in Section 4.1). Furthermore, the variables specified in the IDVARIABLE=, CLASSVARIABLES=, and COVARIATE_MODEL= options must be identical in both input data matrices. A more detailed description of the function of this macro is given in paf_d.sas. First, the PAF_D macro calls the macro EST_MATRIX separately for death and disease to prepare the design matrices (Z and Z * ), vectors of parameter estimates (γ M andγ D ) and their covariance matrices (Σ M andΣ D ) in (3), (4) and (10) needed for the calculation of PAF and its confidence interval for disease incidence (see est_matrix.sas). Second, the PAF_D macro calls the macro EST_PAF_D which using these outputs as its inputs calculates the point estimate, standard error and 95% confidence interval of PAF for disease incidence using the formulas described in Section 3.4. A more detailed description of these calculations in the SAS/IML language is provided in est_paf_d.sas. Finally, the PAF_D macro prints out both the relative risks of both death and incidence of disease and their 95% confidence intervals for the risk factors specified in the COVARIATE_MODEL= option of the macro call as well as the point estimate, standard error and 95% confidence interval of PAF for disease incidence. If there is a problem in the convergence of the model chosen, an error message is delivered.
If the GROUPVARIABLE is given in the PAF_M or PAF_D macro call, the point estimate, standard error and 95% confidence interval of PAF either for total mortality or disease incidence are calculated separately in the subgroups defined by this variable. To do this the design matrices DESIGN_1 and DESIGN_2 prepared by the EST_MATRIX macro are divided into separate design matrices according to the categories of the GROUPVARIABLE. Then, depending on the outcome of interest, either the PAF_M macro calls the macro EST_PAF_M (death) or the PAF_D macro calls the macro EST_PAF_D (disease) as many times as there are categories of the GROUPVARIABLE to calculate the separate point estimates, standard errors and 95% confidence intervals of PAF (see paf_m.sas and paf_d.sas for a more detailed description). After this, either the PAF_M macro calls the macro EST_PAF_DIFF_M (death) or the PAF_D macro calls the macro EST_PAF_DIFF_D (disease) which calculates the differences between these PAF estimates, their 95% confidence intervals as well as a p value to determine the statistical significance of these differences using the formulas described in Section 3.5. More detailed descriptions of these calculations in the SAS/IML language are given in est_paf_diff_m.sas and est_paf_diff_d.sas. Finally, the PAF_M and PAF_D macros print out the groupwise point estimates, their standard errors and 95% confidence intervals of PAF either for total mortality or disease incidence as well as the differences between these groupwise PAF estimates, their standard errors and 95% confidence intervals, and a p value for the statistical significance of these differences.

Data example
This data example demonstrates the importance of certain modifiable risk factors, alone and in interaction, on type 2 diabetes incidence through calculation of PAF for disease incidence using the SAS modules presented in Section 4 previously. It is based on data from the Mini-Finland Health Survey cohort study carried out in 1978-1980(Aromaa, Heliövaara, Impivaara, Knekt, and Maatela 1989. Altogether 4,517 men and women aged 40-79, who participated in a health examination and were free of type 2 diabetes and cardiovascular diseases at baseline, were included in this study. Their height and weight were measured at the health examination, and their body mass index (BMI) was calculated. Casual blood pressure was measured and hypertensive medication self-reported. Also smoking was self-reported. The follow-up time was defined as days from the baseline examination to the date of type 2 diabetes occurrence, death, or end of follow-up, whichever came first. During a 17-year follow-up, a total of 227 individuals developed type 2 diabetes. The categorisation of the risk factors of interest (BMI, blood pressure, and smoking) and the distribution of the individuals as well as the type 2 diabetes cases across these categories is presented in Table 8.
The original data matrix, in which the rows correspond to individuals and the columns to information related to them (such as the total follow-up time), is shown in example. In this example, the follow-up time is divided into 5-year intervals. The original data matrix is then modified into two separate input data matrices (example_death_2 and example_disease), in which there is one row for each follow-up time interval for each individual, by making the following GEN_DATA macro calls (see data preparation in Section 4. After the preparation of input data matrices, the PAF_D macro for the analysis of PAF for disease incidence described in paf_d.sas is called. To calculate PAF, for example, for smoking (SMOKE) so that the current smokers (categories 3 and 4 in Table 8) are hypothetically moved to the reference category of never smokers (category 1 in Table 8 where X is the path to the current working directory. Note that the results are sex-and age-adjusted as sex is included in the model (variable SEX in the CLASSVARIABLES= and COVARIATE_MODEL= option), and age is taken into account through birth cohort (variable B_COHORT in the COHORTVARIABLE= option), according to which the baseline hazard of the piecewise constant hazards model is stratified. In this example, 20-year birth cohorts are used (see example_death_2 and example_disease) since there are no diabetes cases among the youngest and the oldest, and thus shorter birth cohort intervals of, for example, 10 years would lead to zero cells and make the convergence of the model questionable. The results of this analysis are shown in Table 6 (and are also presented in Table 8): Note that although the relative risk (RR) of type 2 diabetes incidence for the individuals belonging to the highest smoking category (category 4: ≥ 30 cigarettes/day) is quite high (RR = 2.79, 95% CI = 1.25, 6.20) (Table 6b), the PAF for the hypothetical modification of the individuals, belonging to this category or to the previous category of the second highest smoking, to the reference category is very low (PAF = 0.02, 95% CI = -0.05, 0.09) (Table 6c). This is due to the low prevalence (1.7%) of the individuals belonging to the highest smoking (a) Relative risks of death and convergence status of the estimation algorithm.
(b) Relative risks of incidence of disease and convergence status of the estimation algorithm.
(c) PAF for disease incidence. category (Table 8), as the PAF measure accounts for both the strength of the association between the risk factor and the outcome (RR) and the prevalence of the risk factor. The corresponding PAF_D macro calls for calculating PAF for BMI and blood pressure are demonstrated in paf_example_d and the results related to them in Table 8. Here, the PAF estimates for the hypothetical modification of BMI (BMI_2) or blood pressure (BP), from category with the higher risk (category 2) to reference category (category 1), are much higher (PAF = 0.68, 95% CI = 0.55, 0.77, and PAF = 0.31, 95% CI = −0.06, 0.55, respectively) than that for smoking as the prevalence of the individuals belonging to the modified categories is much higher (59.9% and 85.5%, respectively).
If we have reason to believe that some factors (e.g., blood pressure BP) modify the relationship between a risk factor, such as BMI (BMI_2), and the outcome, we can analyze this by comparing the PAF results obtained in the subgroups defined by the categories of this potential (a) Relative risks of death and convergence status of the estimation algorithm.
(b) Relative risks of incidence of disease and convergence status of the estimation algorithm.
(c) PAF for disease incidence.
(d) PAF for disease incidence in subgroups of BP.  where X is the path to the current working directory. Thus, an interaction term between the risk factor and potential effect modifying factor must be included in the COVARIATE_MODEL statement and the potential effect modifying factor in the GROUPVARIABLE statement. The results of this analysis are shown in Table 7 (and are also presented in Table 8): In this case, the effect modification by blood pressure turned out to be statistically significant (p value = 0.02) (Table 7d) and the PAF for the hypothetical modification of BMI was much higher for those with elevated blood pressure (PAF = 0.73, 95% CI = 0.60, 0.82) than for those with normal blood pressure (PAF = 0.25, 95% CI = −0.27, 0.55) ( Table 7c).
The estimation and statistical inference of total mortality PAF with the GROUPVARIABLE = BP option are carried out in a similar fashion to the other examples of PAF for total mortality in paf_example_m.sas. The only difference is that in case of the estimation of PAF for total mortality both piecewise and cumulative PAF estimates and their standard errors and 95% confidence intervals are produced.

Discussion
The methods for the estimation of PAF in a cohort study design, taking adequately into account the follow-up time, have been developed during recent years (Chen et al. 2006;Samuelsen and Eide 2008;Cox et al. 2009;Laaksonen et al. 2010a,b). As far as these authors know, no publicly available program to implement these methods has, however, yet been provided, apparently hindering their wider application. The SAS macros based on the recently developed methods (Laaksonen et al. 2010a,b) presented in this paper close the gap between theory and application. Using these macros it is now possible to estimate PAF and its confidence interval in a cohort study design both for total mortality (Laaksonen et al. 2010b) and for disease incidence (Laaksonen et al. 2010a), taking into account the different sources of censoring. The PAF macros are very flexible in that both categorical and continuous risk factors and confounding factors as well as their interactions can be included in the model, as long as the estimation algorithm still converges. In addition, the estimation of PAF in the presence of potential effect modification and analysis of its statistical significance are possible. However, as both the prevalence of the risk factors and the strength of the association between the risk factors and the outcome affect PAF, both of these components should be taken into account when analyzing the potential effect modification to make sure that it is due to the difference in strength of the association between the risk factor and outcome and not just due to different prevalences. Different prevalences can also prevent a difference in strength of association from becoming significant in the analysis of effect modification. Furthermore, although in this paper it was assumed that the parameter estimates and PAF estimates were calculated based on the same data, it is also possible to calculate the parameter estimates from external data and apply them to the data of interest by using the macros: for the total mortality case, EST_MATRIX and EST_PAF_M; for the disease case, EST_MATRIX and EST_PAF_D.
The time-to-event data is modeled based on a proportional hazards model with a piecewise constant baseline hazard function. The baseline hazard in the piecewise constant hazard model can be stratified with respect to both follow-up time and birth cohort. With a judicious choice of the cut-points in the piecewise constant hazard model almost any baseline hazard can be well approximated in large data sets. It should be noticed, however, that it is the user's responsibility to ensure that the choice of the cut-points for follow-up time intervals and birth cohorts results in at least one case at each time interval and birth cohort to guarantee a reliable estimation of PAF. The macro provides information on the convergence status of the model chosen. In addition to the number of the cut-points, the computation time of the macro depends also on the number and type of variables included in the model. In general, however, the computation time is very fast, even with quite closely-spaced cut-points, partly due to the analytic variance estimation.