Estimating the standardized incidence ratio (SIR) with incomplete follow-up data

Background A standard parameter to compare the disease incidence of a cohort relative to the population is the standardized incidence ratio (SIR). For statistical inference is commonly assumed that the denominator, the expected number of cases, is fixed. If a disease registry is available, incident cases can sometimes be identified by linkage with the registry, however, registries may not contain information on migration or death from other causes. A complete follow-up with a population registry may not be possible. In that case, end-of-follow-up date and therefore, exact person-years of observation are unknown. Methods We have developed a method to estimate the observation times and to derive the expected number of cases using population data on mortality and migration rates. We investigate the impact of the underlying assumptions with a sensitivity analysis. Results The method provides a useful estimate of the SIR. We illustrate the method with a numerical example, a simulation study and with a study on standardized cancer incidence ratios in a cohort of migrants relative to the German population. We show that the additional variance induced by the estimation method is small, so that standard methods for inference can be applied. Conclusions Estimation of the observation time is possible for cohort studies with incomplete follow-up. Electronic supplementary material The online version of this article (doi:10.1186/s12874-017-0335-3) contains supplementary material, which is available to authorized users.


Background
The analysis of epidemiologic cohort studies usually requires person-years (py) calculation to measure the exact time at risk. Person-years are not only used to compute different rates within a cohort resulting in e.g. rate ratios (RR), but also to calculate the commonly applied standardized incidence or mortality ratio (SIR, SMR) to compare the observed number of certain events with the expected number given the rates in the underlying population [1]. A SIR or a SMR estimates the occurrence of an event in a population relative to what might be expected if the population had the same experience as some larger comparison population designated as 'normal' or average or reference.
To be able to calculate the exact accumulated number of person-years within a cohort, the dates (i) begin of follow-up and (ii) end of follow-up must be known for each individual. For estimating the SIR, the date of disease onset and reference incidence rates for comparison (to calculate the expected number of cases) must also be available. The latter may come from disease registries, e.g. a cancer registry. The end of the observation period is defined as the minimum of date of death, date of disease onset and end-of-follow-up/lost-to-follow-up date.
Statistical inference for the SIR and the SMR is usually based on the assumption that the denominator is fixed and that the numerator is Poisson distributed. Both approximate and exact solutions have been developed [1]. The analysis is implemented in common software programs like SAS, R or STATA. Few authors, however, have considered the case when the denominator must be considered as random. Silcocks showed that the beta distribution can be used for statistical inference when the expected number of events is subject to random variation [2]. More recently, Beyene and Moineddin compare methods for measuring the "location quotient", which is the relative contribution of a rate in a subset (size n i ) to the whole population (size n) [3]. In both of these cases the variance of the denominator cannot be neglected because the underlying population is large enough relative to the study population.
In the present case, a different situation is considered which has not been paid much attention. While the linkage of a cohort with a disease registry may be straight forward, the linkage with the population registry may be hampered by bureaucracy, data protection issues, costs, or other reason. In that case the exact person-years are unknown. This was the case for a cohort study conducted by us and motivated the present work.
In this paper we present a method to estimate the SIR and to perform statistical inference under the following situation: (i) The cohort is fully defined and information on basic covariates (date of birth, date of entry into cohort, sex, covariates where appropriate) is complete (ii) incident cases of the disease(s) of interest have been identified through linkage with a disease registry, with date of disease onset known. This registry may cover the geographical area from which the cohort originated, but not necessarily the whole country (iii) a mortality follow-up is not available, i.e. it is unknown if a person died of a disease other than the disease of interest, and it is unknown whether and when a person moved out of the catchment area of the disease registry.
This situation is not uncommon. In the second section, we provide the estimate, a procedure to implement this in SAS, an estimate for the variance, and give recommendations for a sensitivity analysis. In the third section, we illustrate the procedure with a simple artificial example.
In the fourth section we investigate properties of the estimation procedure with a simulation study. In the fifth section, we present data from a cohort study in which this situation occurred [4]. It is a study on immigrants from the Former Soviet Union in a German state (Saarland) where we calculated the SIR for cancer, based on linkage with the Saarland cancer registry. This study was performed with 18 621 migrants from the Former Soviet Union who immigrated between 1990 and 2005 to the German federal state of Saarland. For every individual, name, sex, date of birth, date of immigration (start of the observation period), and first city of residence was available. A complete incidence follow-up was done by linking the cohort with the Saarland cancer registry which is a high quality register for this federal state [5]. 470 incident cancer (except non-melanoma skin cancer) cases were identified. Mortality follow-up and tracing of individuals who relocated from the state, however, was not performed. We present a sensitivity analysis to assess the effect of the assumptions made in the method. Finally, we discuss the relevance of our findings.

Outline and notations
We consider data of a cohort with N individuals numbered as n = 1, …, N where date of birth, date of beginning of follow-up, sex and possibly other covariates are known. We assume further that a disease registry exists which covers the geographical area from which the cohort was selected. Linkage with this disease registry yields the date of diagnosis for those diseased before a fixed follow-up date. Then, for each individual the date of diagnosis for those diagnosed within the catchment area of the registry (in the following: study area) or no information on the end of follow-up is given. Thus, four events are possible for each individual as given in Fig. 1 where py ij are the observed person-years in the full cohort for age group i and calendar-period j, and λ ij the Fig. 1 Illustration of possible events during a follow-up procedure for individuals entering the cohort at variable dates corresponding incidence rates for disease D in the reference population. For the estimation according to the later proposed method we need the mortality rates for all other diseases except D, and the out-migration rates. Both may be age-and period-dependent, and we denote these as μ ij and ν ij , respectively.
To introduce the method for person-years estimation, we also need the estimated person-year contribution of an individual n for year 1, 2 … . We denote this as c py i1n , c py i2n etc. Year 1 denotes the first calendar year in which cohort members contribute observation time.

Person-years estimation with incomplete follow-up for a single individual
In order to estimate py ij from the available data, we assume that mortality rates for causes other than the disease of interest μ ij which can be applied to the cohort (by age and calendar period) (event 2) and the migration rates ν ij (event 3) are available. Note that the following applies to the cohort members who did not become incident cases during the follow-up. For these incident cases, person-times are exactly known and can be assigned to age groups and calendar periods as usual.
For sake of simpler presentation we assume for a moment that follow-up start and birthday is January 1 st . We assume further for a moment that age groups and calendar periods of one year are considered, that the expected person-year contribution for an individual n for which one of the events occurred in a year is 0.5 years, and that the events 1,2,3 are independent. Then, the estimated person-year contribution for individual n which is not an incident case belonging to age i at calendar year 1 after entry into the study, c py i1n , is We can assume small rates, so γ i1 can be approximated by γ i1 = μ i1 + ν i1 . So if, for example, μ i1 = 0.002 and ν i1 = 0.013, then γ i1 = 0.002 + 0.013 − 0.002 × 0.013 = 0.015 − 0.000026 = 0.014974 ≈ 0.015, and an estimate for the person-year contribution in the first year is (1-0.015) + 0.5 × 0.015 = 0.9925 person-years instead of a full personyear. In other words, from 1000 individuals in that age group who did not get the disease of interest, an expected number of 15 individuals either die from another disease or migrate within a year. Each of these 15 individuals contribute an expected half person-year (since a uniform distribution of the time of occurrence of the events within a year can be assumed), so the expected total number of person-years for these 1000 individuals is 985 + 15x0.5 = 992.5 person-years, i.e. 0.9925 per person. For year 2 calculation is similar: From 1000 individuals in that age group who did not get the disease of interest, an expected number of 15 either died from another disease or migrated in the first year and therefore do not contribute person-time for the second year.
This procedure continues for an arbitrary year k until the end-of-follow-up-date as This can be rewritten as Since unless the rates change strongly from one year to the next, we use the simpler equation A full numerical example is given in Table 1. In a cohort individual entry dates and birth dates are arbitrary,, and the calendar period changes over time and so do the corresponding rates μ ij and ν ij .. We now use the common notation for person-years which are given according to age groups. In praxis, 5-year age groups i = 1, .., I are commonly used. We use this grouping in the following. The expected person-time that an individual n contributes to the age class i × calendar year j, j = 1, …, J , c py ijn is then given as the probability that the class is entered multiplied by the expected observation time within that class. The probability that the class is entered is the product of the failure probabilities in all previous age × calendar year classes (i' ,j'), i.e. Y is the sum of the mortality and migration rate in class (i,j) and py i ' j ' n * is the maximal person-time (raw person-years) in that class (which is 1 year under this categorization), and thus we get Equation (6) is the general form of Eq. (3) which allows arbitrary entry times, birth dates, and class length. Summing up the expected person-years for each cohort member over all age-groups and calendar years then gives an estimate for the expected person-years in all categories, c py ij ¼ py ijn . The person-years from the cases py ij,cases have to be added. An estimate for the SIR is then given by Computational aspects In principle, the person-years contributions for each individual for each calendar year and age class can be calculated separately, and thus the estimate in Eq. (6) can be used directly. Computationally, it is simpler to first calculate the person-years by age group and calendar period under the assumption μ ij = ν ij = 0, i.e. for all individuals not diseased the study endpoint is used as end of follow-up date for all individuals. Common software to calculate person-years is applied here and gives the 'raw' person-years which we have denoted as py ij * . This also is the upper bound for the total person-years and can be used to calculate the minimal SIR. We apply the above similar as in Eq. (5). Here i is the index for age group and 1 the first calendar year to which individual(s) contribute person-times. If the population did not undergo aging, the adjusted person-years py in the second calendar year would be b since to the raw py in year 2 contribute individuals who enter the cohort in year 2 and individuals who entered in year 1. For the latter group, the expected loss of py in the first year, py Ã i1 − b py i1 , must be taken into account and subtracted from the raw py in year 2. On this adjusted term, , the rates for the year 2 (1 − γ i2 ) must be applied. This procedure continues until the last calendar year is reached for which observation times are recorded.
In the iterative process we must take aging into account, i.e. people move through the underlying age groups. Since mortality rates are commonly provided in five year age groups we use this categorization as well. For a given year, the observation times of a five-year age group undergo aging such in expectation that one fifth will move to the next oldest five-year age group in the next year. The estimated person-years for the second calendar year j2 of age group i, b py i2 is therefore the raw person-years, py i2 * , minus 4/5 of the expected loss in the same age group, minus 1/5 of the expected loss in the next youngest age group, yielding Finally, for an arbitrary year j we get We have written an SAS Macro to perform the required calculations (available as Additional file 1). As mentioned before, the exact solution requires the procedure for the non-diseased only and adding the (known) person-years from the diseased (the cases).

Variance estimation and confidence intervals
Standard procedures for SIR confidence intervals consider the observed number of cases O as random variable which is Poisson distributed and the denominator as fixed. Our procedure to estimate the accumulated person-years adds a variance component to the expected number of deaths E, and therefore an additional variance component must be taken into account. In the following we show that this component is so small compared to the component in the numerator that it can be neglected.
The delta-method asymptotically gives V ar O The evaluation of V ar c py ij Þ is done under the same assumptions on entry times and ages as before. For the first year (1) and for a single individual n, the variance of its person-year contribution is V ar pŷ i1n ð Þ¼ This is shown as follows: Let Y denote the random variable denoting the (unknown) person-time for an individual in the first year of observation. Its density , assuming a uniform distribution given an event (migration or death from other cause) occurred. Then we get expectation We are interested in the variance of Y. We use the relation and then after basic probability theory E(Z) = (1 − γ i1 ) + γ i1 /3 and thus Var(Y) ). For small rates, this is smaller than the binomial variance γ i1 (1 − γ i1 )by a factor of about 3. This follows intuitively since individuals with an event contribute between zero and one person-years in the year of the event, not zero. Then, for all individuals in this year and age For the second year, we have for a single individual V ar In order to show that the variance component from the estimated personyears is negligible it is sufficient to consider an upper bound of the total variance of the expected number of cases, E. and for all individuals in the second year and age group i, we have as an upper bound V ar pŷ i2 ð Þ ¼ pŷ i2 ð Þ 1 þ 0: Summing up, we have as an upper bound ! since λ ij py iĵ ¼ E ij ,the expected number of cases in age group × calendar period (i, j), we write Therefore under the rare disease assumption λ ij and γ ij small, the variance of the denominator is several magnitudes smaller than the variance of the numerator. Additional aspects on variance estimation can be found in Additional file 2. The appropriateness of this method has additionally been checked by simulation using the numerical example below. The evaluation shows that the value resulting from Eq. (13) is very low, and that the variance is very small compared to the variance of the numerator. Therefore, it can be neglected and standard methods to calculate the confidence interval can be used.

Sensitivity analysis
The method developed provides an estimate for the person-years and hence for the SIR if the mortality rates and the migration rates are applicable to the cohort. Since the question may arise whether the mortality rates from the underlying population are applicable to the cohort and whether the migration rates are reliably estimated, the question of bias may be more relevant than the additional variance induced by the person-years estimation. If γ ij is underestimated, the person-years are overestimated, up to the maximum when γ ij is zero. Conversely, if γ ij is overestimated, the person-years are underestimated. We recommend to perform a detailed sensitivity analysis by estimating the SIR with reasonably modified migration and mortality rates (see example below). As will be seen in the study example (section 4), the SIR estimate is relatively robust against violations of the assumptions. If migration rates are reliably estimated, however subject to random variation, an additional variance component can easily be incorporated (see numerical example below).

A numerical example
A numerical example is given in Table 1 to show the effect of the estimation procedure on the person-years allocation in the (calendar year) × (age) cross-classification. We consider one single individual n which entered the cohort on January 1st, 1990. This person did not become an incident case during the follow-up period. Therefore, death from another cause or out-migration out of the catchment area of the registry could have occurred at any date during the follow-up period with death rates μ ij and migration rates ν ij , depending on calendar year and age. We assume different migration and mortality rates in different age groups. According to this example, this person with an unknown follow-up status contributes an expected value of py n1 =0.992 person-years to the first year of follow-up, py n2 =0.978 person-years to the second, and py n15 =0.816 person-years to the last year of follow-up instead of a maximal value of one personyears per year. The expected total number of personyears for this person is 13.487 instead of a maximal possible value of 15 years. These are realistic rates and yield a reduction of person-years of 10.1%. If γ ij are assumed fixed, the Variance of py (n)(1) is approximately 0.5 2 γ ij (1 − γ ij )= 0.004, the variance of py n2 and py n3 is approximately (1 + 0.5 2 )γ ij =0.019, and (2 + 0.5 2 )γ ij =0.035, and so on. The variance of the total person-years estimate for this individual is 1.335. If γ ij are random with variance var(γ ij ) = σ γ 2 , we get var(py n1 )= γ ij (1 − γ ij ) + σ γ 2 . The variance component from the rates is here assumed as σ γ 2 = 0.05 2 . Then the variance of the total person-years estimate for this individual is 1.667, an increase of 22%, however, still so small that it can be neglected.
It would be incorrect to assign to this individual 13.429 person-years from the begin of follow-up since this would wrongly yield to a fixed loss-to-follow-up date in the year 2003, and hence full years of observation for all previous years, and zero person-years for the year 2004.
In Additional file 3 we expand this numerical example by assuming that we have a cohort of 1000 individuals with the same characteristics as the individual above. It shows numerically that the variance component induced by the person-years estimate in the denominator of the SIR is small compared to the variance component in the numerator so that it can be neglected. The sensitivity analysis for this dataset used rates with factors 1.5 and 0.75 yielded a bias in the SIR of only 7%. It shows that the procedure is relatively robust against violations in the rate estimates.

Simulation study
A simulation study has been performed to investigate the performance of the estimation procedure. The setup of the simulation is as follows: A cohort of size N = 1000 was simulated with parameters as in the preceding numerical example. For each dataset, the number of observable events (incident cases) and unobservable losses (deaths from other causes or out-migration) and the corresponding failure time of the events and time of loss was simulated. The exact number of person-years was recorded and compared with the estimated number of person-years according to the above procedure. The main parameter for comparison was the relative difference of observed and exact person-years, i.e. Δ py rel ¼ py exact −py estimated py exact .
According to the chosen parameters we observed between 3 and 23 events (median 11, mean 10.75) and between 145 and 218 losses from deaths or outmigration within the 15 year observation period (median 179, mean 179.7). The simulation results regarding the observed and estimated py as well as the absolute and relative difference and number of events are given in Table 2. We observe a slight underestimation of the true person-years of 0.4% with an empirical 90% confidence interval as (−1.0% -1.9%). We conclude that the procedure is sufficiently accurate. In comparison we would find a mean overestimation of the person-years of 10.7% if all individuals are wrongly assumed to remain under risk.
The simulation program was written in SAS and is listed in the supplemental material.

Application
A cohort study was performed with 18621 migrants from the Former Soviet Union who immigrated between 1990 and 2005 to the German federal state of Saarland. For every individual name, sex, date of birth, date of immigration (start of the observation period), and first city of residence was available. A complete incidence followup was done by linking the cohort with the Saarland cancer registry which is a high quality register for this federal state [5]. 470 incident cancer cases (except nonmelanoma skin cancer) were identified. Mortality followup and tracing of individuals who relocated from the state was not performed.
Age, sex and calendar-year specific "raw" personyears were calculated at the end of the observation period (31 st December 2005) by calendar year and 5year age groups. For individuals with cancer diagnosis, the date of diagnoses was set as the endpoint of their observation time. A summarized result is given in Table 3.
Adjustment of the person-years was done according to Eqs. (1) and (2). The German mortality rates for all causes of death except cancer within the cohort (μ ij ) were used. These were obtained from WHO mortality database [6]. The migration rates (ν ij ) were taken from a partial follow-up and from a similar study on migrants in another federal state [7]. This is considered the most appropriate model. The rates are given in Table 4 (model 1). The cancer incidence rates λ ij were taken from the Saarland cancer registry. Table 5 gives the "raw" and adjusted person-years for the 5-year age groups and calendar periods for the cohort. The overall cancer SIR is 1.11 in males and 1.01 in females. Using the "raw" person-years for calculating E, the SIR would be 0.90 and 0.86. The difference between "raw" and estimated total person-years was 18959.7 person-years, or 11.3%.
The sensitivity analysis is done by estimating the number of person-years with different assumptions on migration rates. Model 2 uses halved and model 3 doubled rates (see Table 4). Results of the estimation procedure in terms of person-years and expected number of cancer cases are shown in Table 5 for selected cancer sites based on Saarland incidence rates [8]. Relatively little variation between the different models is observed. The biggest difference to model 1 for the expected number of cases is seen for model 2 in all female cancers with 22.3 cases. For other cancer sites the maximum differences (found between models 1 and 2) in incident cancer cases are 4.3 and 6.8, respectively. Sensitivity analysis demonstrates that estimated expected number of cancer cases and therefore corresponding SIR are quite robust with regard to the underlying assumptions. SIR for different cancer sites according to different models are also shown in Table 5 with the respective standard 95% confidence intervals. Model 4 gives the lower limit of the SIR estimates using the raw person-years for calculating the expected number of cases. On average, the expected number of cases are about 25% higher in this model.

Discussion
We have presented a method to estimate the standardized incidence ratio in a cohort study for which a linkage with a disease registry has been performed, however, for which a mortality follow-up including tracing of individuals which moved out was not directly possible. This situation is not a rare one: cancer registries exist in many parts of the world, and disease registries for other diseases, such as  diabetes, stroke, and others have been developed, e.g. [9,10]. A straightforward mortality follow-up is possible only in countries with a death register or comparable data resources, such as for example the US, Great Britain and the Scandinavian countries. For many other countries, such as Germany, Switzerland, France, Italy, Spain and others the method presented here is an alternative to the sometimes heavily bureaucratic mortality follow-up procedures.
Epidemiologists use these registries to assess disease incidence, and a linkage of cohorts with these registries is technically straightforward. We showed that under quite robust assumptions it is possible to estimate the accumulated number of person-years, and provided a method to calculate the SIR and its 95% confidence intervals taking into account the estimation procedure.
In our example, we present the analysis of cancer incidence in a cohort by making assumptions on mortality from other causes and on migration. First, the same mortality as the German population within the cohort was taken. Results from a similar cohort show that this assumption may be reasonable [4]. A slight underestimate of the number of person-years is possible, though. The second assumption refers to migration. There are good estimates on migration by age group from other cohort studies in Germany, and also on migrants from the former Soviet Union who migrated to another state in Germany. We think these rates can be applied here, however, some uncertainty will remain. A small subset of the cohort (about 20%) has been followed-up by now, and results support our assumption.
Here, we have considered cancer as the disease of interest which can be regarded as a rare disease in terms of the yearly incidence rates. If the disease is common, then the time from begin of observation until disease onset is shorter, and then for more individuals the raw person-time is the correct person-time. Consequently, the bias when using the raw person-years becomes smaller.
Our sensitivity analyses were reassuring in that respect. SIR estimates showed relatively little variation with modified rate assumptions. Still, a bias in the estimated expected value of cases is possible, and this appears to play a bigger role than the increase of variance by the estimation procedure. We showed that the additional variance induced by variation in the denominator is small relative to the variation in the number of observed cases. This allows using standard method of inference, in particular the application of exact confidence intervals if the observed number of cases is small. The completeness of the registry is another relevant issue. In the case of the cancer registry in the Saarland, the percentage of DCO (death certificate only) cases in the registry is very low, and all those cases within the cohort are likely to have been identified with the linkage procedure.