Test–retest reliability of self-reported diabetes diagnosis in the Norwegian Women and Cancer Study: A population-based longitudinal study (n =33,919)

Objective: Self-reported information from questionnaires is frequently used in epidemiological studies, but few of these studies provide information on the reproducibility of individual items contained in the questionnaire. We studied the test–retest reliability of self-reported diabetes among 33,919 participants in Norwegian Women and Cancer Study. Methods: The test–retest reliability of self-reported type 1 and type 2 diabetes diagnoses was evaluated between three self-administered questionnaires (completed in 1991, 1998, and 2005 by Norwegian Women and Cancer participants) by kappa agreement. The time interval between the test–retest studies was ~7 and ~14 years. Sensitivity of the kappa agreement for type 1 and type 2 diabetes diagnoses was assessed. Subgroup analysis was performed to assess whether test–retest reliability varies with age, body mass index, physical activity, education, and smoking status. Results: The kappa agreement for both types of self-reported diabetes diagnoses combined was good (⩾0.65) for all three test–retest studies (1991–1998, 1991–2005, and 1998–2005). The kappa agreement for type 1 diabetes was good (⩾0.73) in the 1991–2005 and the 1998–2005 test–retest studies, and very good (0.83) in the 1991–1998 test–retest study. The kappa agreement for type 2 diabetes was moderate (0.57) in the 1991–2005 test–retest study and good (⩾0.66) in the 1991–1998 and 1998–2005 test–retest studies. The overall kappa agreement in the 1991–1998 test–retest study was stronger than in the 1991–2005 test–retest study and the 1998–2005 test–retest study. There was no clear pattern of inconsistency in the kappa agreements within different strata of age, BMI, physical activity, and smoking. The kappa agreement was strongest among the respondents with 17 or more years of education, while generally it was weaker among the least educated group. Conclusion: The test–retest reliability of the diabetes was acceptable and there was no clear pattern of inconsistency in the kappa agreement stratified by age, body mass index, physical activity, and smoking. The study suggests that self-reported diabetes diagnosis from middle-aged women enrolled in the Norwegian Women and Cancer Study is reliable.


Introduction
Epidemiological studies often rely on self-reported information, as this renders the costs of data collection lower than that of clinical studies. 1 However, the validity and reliability of the instruments used for data collection are often not reported. 2 Commonly, the Cohen's kappa coefficient is used to determine inter-rater agreement for disease (or other categorical outcomes) by comparing self-reported information against a gold standard (diagnostic test, medical records, physiological measures, etc.). Previous validation studies of self-reported diabetes diagnosis have indicated that diabetes is reported more accurately than other illnesses or diseases. [3][4][5][6][7][8][9][10] The Cohen's kappa coefficient can also be used to analyze the test-retest reliability of an instrument. Many studies from Norway have used self-reported information from questionnaires as the principle tool, but few  of them have provided information on the reproducibility of the individual items and instruments therein. It is important to establish that respondents with different socio-demographic background, and age groups have understood the questions in a similar manner. Test-retest reliability is assessed by measuring the responses of the same study sample to an identical question at two or more points in time. 44 These responses are then compared to establish the reliability of the instrument. The chi-square (χ 2 ) test for independence is not appropriate for assessing test-retest reliability since it does not take into account that the data are paired (i.e. different measurements for the same individual).
Previous studies using self-reported data from interviews have studied the test-retest reliability of self-reported diabetes diagnosis, with inconsistent kappa agreements. [45][46][47][48][49][50] Since type 2 diabetes typically affects people aged 40 years and over, 51,52 it is possible to differentiate between the test-retest reliability of self-reported type 1 and type 2 diabetes diagnoses using information on age at diagnosis. No previous study was found that assessed the test-retest reliability for either type 1 or type 2 diabetes separately.
The Norwegian Women and Cancer (NOWAC) Study 53 is a prospective cohort study in which women reported diabetes diagnosis and age at diagnosis in three separate questionnaires. If a woman accurately reported her diabetes diagnosis in one study, she is expected to report the same in a subsequent study. This assumption underlies our testretest reliability analysis. The aim of this study was to assess the test-retest reliability of self-reported diabetes diagnosis, as well as that of type 1 and type 2 diabetes diagnoses separately. Furthermore, the large sample size permits subgroup analyses and sensitivity analysis. We examined whether test-retest reliability varies with age, body mass index (BMI), physical activity, education, and smoking status.

Study cohort and sampling
The NOWAC Study is a prospective nationwide study which started in 1991, 54,55 and contains data from 170,000 women. Participants were randomly selected from the National Population Register of Norway. The external validity of the study 56 and validity of some measures [57][58][59] have been published elsewhere. NOWAC Study participants are assumed to be representative of the female Norwegian population in the corresponding age groups. 56 The detailed characteristics of the participants are described elsewhere, 56 and the updated information on the NOWAC Study is accessible on its website. 54 Of the 170,000 women enrolled in the NOWAC Study, 33,919 women completed all of three questionnaires sent in 1991, 1998, and 2005. The general characteristics of the study sample and the association between BMI and type 2 diabetes in this sample are described elsewhere. 52 Questionnaire and classification Diabetes. Information on diabetes diagnosis was collected by means of the same question in all three questionnaires (1991, 1998, and 2005): "Have you had any of the following diseases?" The list of options included diabetes. Age at diagnosis was measured with the subsequent question, "If yes, at what age was it first discovered?" For the purposes of this study, only participants who reported having diabetes and provided their age at diagnosis were defined as diabetes cases. If participants reported they gave birth to a child either the same year they were diagnosed with diabetes, or in the year preceding child birth, it was assumed that they had gestational diabetes, and they were excluded from the analysis. Final numbers of diabetes cases included in analyses are given in Tables 2-4. Participants with missing values on diabetes diagnosis and age at diagnosis were excluded.
Using the responses to the questions on diabetes and age at diagnosis, different variables for diabetes diagnosis, and separate variables for type 1 and type 2 diabetes, were created. Since type 2 diabetes typically affects people aged 40 years or over, 51,52 we classified only those aged 40 years or over as having type 2 diabetes. Women who were diagnosed with diabetes at or before age 39 years were categorized as having type 1 diabetes (excluding those with gestational diabetes). Participants with type 1 and type 2 diabetes were classified separately by the above-mentioned criteria for the 1991 test study, the 1998 test study, the 1998 retest study for comparison against 1991 test study, the 2005 retest study for comparison against the 1991 test study, and the 2005 retest study for comparison with the 1998 test study.
Diabetes cases in the 1991 and 1998 test studies were defined as those who reported having diabetes, and their age at diagnosis in the corresponding questionnaires. One respondent to the 1998 questionnaire fulfilled the criteria for both gestational diabetes and type 2 diabetes and was finally classified as having gestational diabetes only.
Diabetes in the 1998 retest study (for comparison against the 1991 test study). Diabetes cases in the 1998 retest study, for comparison against the 1991 test study were defined as those with diabetes from the 1998 test study, provided they reported a date of diagnosis prior to 1992. The same criteria were applied to women with type 1 or type 2 diabetes. One women in the 1998 retest study fulfilled the criteria both for gestational and type 2 diabetes and was finally classified as having gestational diabetes only. Covariates. Self-reported information on height and weight from 1998 study was used to calculate BMI (kg/m 2 ). BMI was categorized into three groups: normal weight (BMI: <25 kg/m 2 ), overweight (BMI: 25-29.9 kg/m 2 ), and obese (BMI: ⩾30 kg/m 2 ). Smoking status was derived from the replies to two questions in the 1998 questionnaire: "Have you ever smoked?" (yes, no) and "Do you smoke on a daily basis at the moment?" (yes, no). Women who answered "no" to the former were categorized as "never smokers." Those who answered "yes" to the former, and "no" to the latter, were categorized as "former smokers," and those who answered "yes" to both questions were categorized as "current smokers." A 10-category scale measured the level of self-reported physical activity in the 1998 questionnaire, the validity of which has been reported previously. 21 Responses to questions about physical activity were used to assign a category of physical activity: low [1][2][3], medium [4][5][6][7], and high [8][9][10]. Education (duration in years) was categorized into four groups: primary/intermediate (0-9), secondary (10-12), university (13)(14)(15)(16), and postgraduate and above (17+). Age (years) was categorized in four groups with 5-year interval.

Statistical analysis
Statistical analysis was performed with SAS version 9.2 and Stata version 13.1. Means (standard deviation (SD)) were estimated for all continuous variables, and the percentage of participants in each category was calculated for all categorical variables. General characteristics of the data are presented as frequencies, percentages, and means with SDs, respectively (Table 1). Variables for all diabetes diagnoses, as well as for type 1 and type 2 diabetes separately, were constructed, and the kappa agreement for the two types of diabetes was calculated for the 1991-1998 test-retest study, the 1991-2005 test-retest study, and 1998-2005 test-retest study, respectively. The kappa coefficients summarize the total agreement beyond that expected by chance. 95% confidence intervals (CIs) for kappa statistic were estimated with analytical method 60 in Stata. 61

Sensitivity analysis
Since self-reported age at diagnosis was used as the only discriminative criterion for distinguishing between type 1 and type 2 diabetes, sensitivity analysis was performed by restricting age at diagnosis <35 years for type 1 diabetes and age at diagnosis >44 years for type 2 diabetes (Table 5). Those reporting age at diagnosis 35-44 were excluded for the purpose of assessing sensitivity of the kappa agreements (Table 5).

Subgroup analysis
Subgroup analysis was performed to assess the consistency of the kappa agreement across stratas of the covariates (Tables 6-10).

Ethical approval
The NOWAC Study was approved by the Regional Committee for Medical and Health Research Ethics. All participating women gave written informed consent. Table 1 presents the general characteristics of the study sample. Among the 33,919 women participating in 1991, 1998, and 2005 study, the age distribution was between 40 and 59 (mean: 47.7 ± 4.3) in 1998. Majority (64.6%) of the respondents had normal weight (BMI: <25 kg/m 2 ). Almost 40.3% of the respondents had some university education or more. Most (75.5%) of the respondents were classified as having medium level of physical activity. In this study sample, 28.2% were classified as being current smoker, while 31.2% were classified as being former smokers.  (Table 4). Table 5 presents the sensitivity of the kappa agreements by classifying those reporting age at diagnosis less than 35, as diagnosed with type 1 diabetes. While, classifying those reporting age at diagnosis greater than 44 as diagnosed with type 2 diabetes. The kappa agreements remained moderate to good for type 1 diabetes, while the kappa agreements for type 2 diabetes were fair to good (Table 5).

Results
Tables 6-10 present the kappa agreement for diabetes stratified by age, BMI, physical activity, education, and smoking status. There was no clear pattern of inconsistency in the kappa agreements within different strata of age, BMI, physical activity, and smoking (Table 6-8 and 10). However, the stratified analysis by the level of education shows that the kappa agreement is strongest among the most educated group (Table 9) in all the test-retest comparisons, while generally it was weaker among the least educated group.

Discussion
In this study, we analyzed the test-retest reliability of selfreported diabetes diagnosis in a large sample of middle-aged women in Norway. We observed that the agreement was good for all diabetes diagnoses combined in all three testretest studies. The weakest agreement was found in the 1991-2005 test-retest study. This was to be expected, as the time interval between these studies was the longest. These results also suggest that other confounding factors may have affected self-reported diabetes diagnosis in the 1991-1998, or 1998-2005 test-retest studies, as the agreement in these periods was expected to be more similar. The fact that diabetes diagnosis may change over time could have contributed to the decreasing agreement observed between the 1991-1998 test-retest study and the 1991-2005 test-retest study. However, looking at the two types of diabetes separately revealed some differences in the kappa agreement. The     One possible reason for the higher kappa agreement among women with type 1 diabetes in our study is that these women may have severe complications sooner 64 than women with type 2 diabetes; this may have contributed the women's recall of age at diagnosis, resulting in a higher agreement for type 1 diabetes.    Since type 2 diabetes typically affects people 40 years of age and over, 51,52 we classified only women aged 40 years and over as having type 2 diabetes. However, it is still possible that women younger than 40 years of age have developed type 2 diabetes. [65][66][67][68][69] In addition, cases identified as having gestational diabetes were excluded from the type 2 diabetes group, although women who had gestational diabetes may develop type 2 diabetes later in life. 70,71 Women aged 39 years or less who reported a diabetes diagnosis (excluding gestational diabetes) were categorized as having type 1 diabetes. Since type 1 diabetes can occur at any age, 72 it is also possible that some of the women classified as having type 2 diabetes in fact had type 1 diabetes. Due to the design and self-reported nature of the study, it was not possible to confirm the exact type(s) of diabetes diagnosis. The results from sensitivity analysis restricting type 1 diabetes cases to those reporting age at diagnosis less than 35 years, and restricting type 2 diabetes to those reporting age at diagnosis more than 44 years, were still acceptable.
This study was larger than previous studies, permitting subgroup analyses. No clear pattern of inconsistency in kappa agreements was observed between different strata of BMI, physical activity, and smoking status. Although no formal test of heterogeneity was performed to assess the statistical difference in kappa agreements across the subgroups, there was a pattern across education groups. The kappa agreement was strongest among the most educated group, while generally it was weaker among the least educated group.
Although the NOWAC cohort is representative of Norwegian women in corresponding age groups, the current sample may not be a representative sample since it includes only the women participating in all the three waves of the study. Furthermore, the respondents with missing values were excluded. Some research suggests that those belonging to the low socio-economic strata, and are relatively unhealthy, are likely to have a higher proportion of missing values in observational study. 73 Multiple imputation (MI) was not performed, since the kappa statistic 61 is not supported with MI software's [74][75][76][77] in Stata. Therefore, the possibility of selection bias limits the external validity of this study.
The kappa agreement we report here is not comparable to other studies 63,78 due to differences in the proportion of people reporting a certain type of diabetes in different studies, or differences in distribution. We found few studies assessing the test-retest reliability of diabetes diagnosis, and the results of those that were found were not consistent. Most showed very good agreement [45][46][47][48][49]79 between the test and the retest studies, while others showed a good 50 or moderate 80 level of agreement. However, most of the studies we found [46][47][48][49]80 did not report either the significance probability or the CIs. One possible reason for the higher kappa agreement reported in previous studies 45-50 may be the relatively small time interval between the test and retest studies, as compared to the ~7-or ~14-year interval in our study. The relatively smaller time interval between the test and retest studies may have caused respondents in other populations to remember their previous response more easily, resulting in a higher kappa coefficient.
Another key difference between previous studies 45-50 and our study was their use of interview to collect the information on diabetes diagnosis. As these studies used an interview setting, it is reasonable to assume that the respondent had a chance to ask for questions to be repeated, or for further explanation/clarification, and that the interviewer might have provided it. This may have helped the respondents to understand the question better, and to therefore report more accurately. It is probable that this key difference in the investigation tool increases the kappa agreement for the test-retest reliability of the studies using interviews to collect data. However, a study from Manhattan (New York) 80 reported on the test-retest reliability of diabetes diagnosis using telephone interviews. The retest study was conducted within 30 days of the test study, and the kappa agreement between the test and retest studies was found to be 0.48, which is very low considering the short time interval, and despite the use of interviews to collect data. This shows that a short time interval between the test and the retest study and the use of interviews do not necessarily increase the kappa agreement.
The strength of this study is that, it is the first to assess the test-retest reliability of self-reported diabetes diagnosis separately for type 1 and type 2 diabetes. Other strengths of our study include a large cohort size, sensitivity of the estimates by self-reported age at diagnosis, and subgroup analysis within different covariates. This study provides new insights into earlier research by providing the reliability of selfreported diagnosis separately for type 1 and type 2 diabetes.

Conclusion
In conclusion, this study shows that the reliability of the self-reported information on diabetes diagnosis from a large prospective cohort study with long time interval is satisfactory.