Inter-rater reliability of a national acute stroke register

Medical quality registers are useful sources of knowledge about diseases and the health services. However, there are challenges in obtaining valid and reliable data. This study aims to assess the reliability in a national medical quality register. We randomly selected 111 patients having had a stroke in 2012. An experienced stroke nurse completed the Norwegian Stroke Register paper forms for all 111 patients by review of the medical records. We then extracted all registered data on the same patients from the Norwegian Stroke Register and calculated Cohen’s kappa and Gwet’s AC1 with 95 % confidence intervals for 51 nominal variables and Cohen’s quadratic weighted kappa and Gwet’s AC2 for three ordinal variables. For two time variables, we calculated the Intraclass Correlation Coefficient. Substantial to excellent reliability (kappa > 0.60/AC1 > 0.80) was observed for most variables related to past medical history, functional status, stroke subtype and discharge destination. Although excellent reliability was observed for time of stroke onset (ICC 0.93), this variable was hampered with a substantial amount of missing values. Some variables related to treatment and examinations in hospital displayed low levels of agreement. This applies to heart rate monitoring (kappa 0.17/AC1 0.46), swallowing test performed (kappa 0.19/AC1 0.27) and mobilized out of bed within 24 h after admission (kappa 0.04/AC1 −0.11). A majority of the variables in The Norwegian Stroke Register have substantial to excellent reliability. The problem areas seem to be the lack of completeness in the time variable indicating stroke onset and poor reliability in some variables concerning examinations and treatment received in hospital.


Background
In many countries, there has been an increasing interest in medical quality registers as a tool for improving the quality of care and as sources of knowledge about diseases and of the health services. Consequently, a number of local, regional and national medical quality registers have been established. There seems to be agreement on the importance of such registers. However, there are challenges in obtaining valid and reliable data.
Few studies have investigated the validity of stroke registers [1], and these typically focus on calculating measures of completeness. Although some studies have investigated the reliability of selected key variables in registers [2,3], we are aware of only one study assessing the reliability of all the variables in a stroke-specific register [4]. Reeves et al. found excellent inter-rater reliability for many variables, but several variables in need of improvement were also identified. These include stroke onset time, stroke team consultation, time of initial brain imaging and discharge destination.
Since 1 January 2012 all Norwegian hospitals are requested by law to report medical data on all hospitalized patients who fulfill the WHO criteria [5] for an acute stroke diagnosis to the Norwegian Stroke Register [6].
In the present study, we assessed the reliability of all the variables in the Norwegian Stroke Register by studying inter-rater reliability in a random sample of 111 patients.

The Norwegian Stroke Register
All hospitalized cases of acute stroke are to be reported to the Norwegian Stroke Register, irrespective of whether the patient was treated in a stroke unit or not. Data are initially registered on paper forms locally at the hospitals by dedicated and trained physicians and nurses, who subsequently enter data into the Stroke Register by use of a web based form. The completion of the registration process at the hospitals often takes place after the patient was discharged by look-up in the electronic medical records. As there are one or more registrars at each stroke unit or hospital, depending on the size of the unit, the reporting nurse or physician may or may not have been involved in the treatment of the patient.
The register contains person identifiable data on the patients' functional status before the stroke, past medical history, the use of drugs prior to hospitalization and at discharge, clinical findings on admission to hospital, diagnostic procedures, treatment received during hospitalization and dates and times for stroke onset, admission to hospital and discharge. A user manual provides definitions of the variables and data entries [7].

Data collection
The study population consists of patients hospitalized in one of four hospitals in Central Norway (St. Olav's University Hospital, Levanger Hospital, Kristiansund Hospital and Ålesund Hospital). The four hospitals were chosen as they had reported data to the Stroke Register since 2004, and had by 2012 established well-functioning data collection routines. A total of 1253 patients were registered with an acute stroke diagnosis (ICD-10 codes I61, I63, I64) in the Norwegian Patient Register in the period of 1 April-31 December 2012 in the four hospitals under study. From these registrations, we selected 120 patients using a random number-generator.
An experienced nurse working at a stroke unit in one of the hospitals under study (St. Olav's University Hospital) filled in the Norwegian Stroke Register paper forms for the 120 patients by a review of the patients' medical records. She did not work at the stroke unit in 2012, and consequently did not perform the original registrations. The nurse was given access to all necessary information, including results of diagnostic tests and examinations and laboratory tests. The data collection was done during May-June 2014. The nurse did not receive any particular training; the only guidance she received was the Norwegian Stroke Register User Manual which is accessible to all registrars. The reason for this was that we wanted to mimic a "real world" situation as far as possible. The professional background of the nurse and the situation in which the registrations took place were considered typical to the actual registration procedures at hospitals around the country.
When extracting data from the Norwegian Stroke Register, 9 of the 120 cases were not found. Consequently, the sample size was reduced to 111 patients.

Statistical analysis
The sample size was determined on the basis of recommended sample size calculations for the kappa statistic. The Goodness-of-fit approach by Donner and Eliasziw [8] states that based on alpha and beta error rates of 0.05 and 0.2, respectively, when testing for a statistical difference between moderate (0.40) and excellent (0.90) kappa values, sample size estimates range from 13 to 66. Our sample of 111 patients thus seems appropriate to detect generalizable estimates of inter-rater reliability.
For all nominal variables the inter-rater agreement is presented in terms of observed agreement, Cohen's kappa and Gwet's AC 1 with 95 % confidence intervals (CI) [9][10][11]. Cases with missing values were excluded. For the three ordinal variables we used the quadratic weighted kappa and AC 2 , and the category "unknown" was excluded.
The kappa statistic is influenced by the trait prevalence and rater bias [12,13]. In situations where a large proportion of the ratings are either positive or negative, the unbalanced prevalence of the trait will lead to a reduced kappa coefficient. In situations where there is a systematic difference between the two raters' tendencies to make particular ratings, the kappa coefficient may be inflated. Gwet's AC 1 and AC 2 , however, is not affected by trait prevalence or rater bias [11].
When interpreting chance-corrected agreement, we use the criteria suggested by Landis and Koch [14], stating that a value between 0 and 0.20 implies slight agreement, 0.21-0.40 fair, 0.41-0.60 moderate, 0.61-0.80 substantial, and >0.80 excellent agreement. To aid interpretation of the agreement coefficients, cross tables for all presented variables are included in Additional file 1: Appendix S1.
Time variables were re-calculated as minutes past midnight when the corresponding date variable was the same for both raters (nurse and Stroke Register). We calculated the Intraclass Correlation Coefficient (ICC) using a two-way random effects consistency ANOVA model. ICC assesses agreement by comparing the variability of different ratings of the same subject to the total variation across all subjects [15]. The interpretation of the magnitude of ICC is similar to that of kappa; a coefficient of 0 means no agreement and 1 means full agreement. In addition to the ICC estimates, we calculated the mean and standard deviation of the differences between the raters.
When reporting results, we have strived to fulfil the recommendations in the Guidelines for Reporting Reliability and Agreement Studies by Kottner et al. [16]. We used IBM SPSS statistics 21 for building analysis files and for calculating ICC, and AgreeStat 2015.4 for calculating kappa, AC 1 and AC 2 statistics.
The study was approved by the Norwegian Directorate of Health and The Norwegian Data Protection Authority.

Results
The sample of 111 patients consisted of 53.2 % men, with mean age 73.1 years (SD 14.9). In comparison, the total population in the Norwegian Stroke Register in 2012 consisted of 51.7 % men with mean age 74.1 years (no SD available) [17]. Additional file 2: Appendix S2 shows the number of hospitalizations with stroke and sample size for each of the four hospitals included in the study.
The agreement was substantial to excellent (kappa >0.60/AC 1 >0.80) for most variables concerning functional status before stroke onset, past medical history and drug treatment prior to stroke onset (Table 1). For previous incidence of transient ischemic attack (TIA), the kappa of 0.54 indicated moderate agreement. However, an AC 1 of 0.91 and an observed agreement of 91.9 % suggested that this variable had a skewed trait distribution and hence an artificially low kappa coefficient.
The variable patient woke with stroke symptoms had a kappa coefficient of 0.28 and an AC 1 of 0.49, indicating fair to moderate agreement. For the variable place of stroke onset a kappa of 0.58 was balanced by an AC 1 of 0.96 and an observed agreement of 96.4 %, thus the reliability was considered to be good for this variable.
For variables concerning diagnostic imaging there was substantial to excellent agreement (kappa > 0.56/AC 1 > 0.82). However, the variable heart rate monitoring had slight to moderate agreement (kappa 0.17/AC 1 0.46).
The variable mobilized out of bed within 24 h after admission had no better agreement than by chance (kappa 0.04/AC 1 −0.11), and the variable swallowing test performed showed only slight to fair agreement (kappa 0.19/AC 1 0.27).
For the stroke subtype variable, there was close to perfect agreement (kappa 0.97/AC 1 0.99).
All response options in the variable discharge destination indicated substantial agreement (kappa 0.69/AC 1 0.72) ( Table 3). Drug treatment at discharge seemed to be varying in the level of agreement between the different types of drug, from a kappa of 0.25/AC 1 of 0.65 for Dipyridamole as the least reliable variable to kappa and AC1 > 0.90 for ADP receptor antagonist, ACE inhibitor, A2 receptor blocker and statins as the most reliable variables.
There was a substantial amount of missing values in date (24.3 %) and time (58.6 %) of stroke onset in the data recorded by the nurse and in time of stroke onset in the Stroke Register (40.5 %). Time of hospital admission was missing in 13.5 % of the cases for the nurse and 2.7 % of the cases for the Stroke Register. As a consequence, only 42 and 89 cases out of 111 were included in the calculation of ICC for stroke onset time and hospital admittance time, respectively (Table 4).
Both time variables had excellent agreement (ICC 0.93-0.98) where there was a value recorded. However, the mean difference and variance between the raters were greater for stroke onset time than for hospital admission time, as was the level of incompleteness.

Discussion
Most of the variables in the Norwegian Stroke Register appeared to have substantial to excellent reliability, including many of the variables related to past medical history, functional status before the stroke, discharge destination, stroke subtype and drug treatment prior to the stroke. Variables related to focal symptoms and speech problems at admittance to hospital showed moderate agreement, while the variable describing whether the patient woke with symptoms indicated slight to moderate agreement. Furthermore, reliability was low for variables related to several examinations and clinical findings during hospitalization, such as monitoring heart rate, test of swallowing function and whether the patient was mobilized out of bed <24 h after admission.
When interpreting the results, for variables with substantial discrepancy between the kappa and AC 1 coefficients, the variable was considered reliable where kappa was low and AC 1 and observed agreement was high. In these cases the kappa coefficient was considered artificially low due to skewed trait prevalence. Distribution of trait prevalence for all variables is shown in Additional file 1: Appendix S1.
Time of stroke onset is a particularly important variable, as it is essential in determining eligibility for thrombolytic therapy. In this study, time of stroke onset and admission to hospital indicated excellent agreement where such time was in fact registered. However, there was a large proportion of missing values in these data. The data collection procedures in this study appear to have had some impact on the data completeness. For the nurse, using paper forms to fill in the data, there was a substantial amount of missing values in both date and time of stroke onset and in time of hospital admission. In the Stroke Register, on the other hand, time of stroke onset was the only date/ time-variable which was hampered with a high degree of missingness. A likely explanation is that the content in the Norwegian Stroke Register is collected via an online electronic form where the fields for stroke onset date and hospital admission date are mandatory fields. The time variables, however, are not mandatory.
Other studies have found a similar high degree of missingness when recording time of stroke onset [4,18,19]. Proposed suggestions to correct this weakness in the data have ranged from developing a real-time data collection system for recording stroke onset time, to the use of standardized time windows instead of the actual hour/ minute of the onset [20]. Although this study indicated a high degree of reliability for the date and time variables, the results suggests that steps needs to be taken to ensure more complete recordings of time of stroke onset in the Norwegian Stroke Register.
The registration of stroke symptoms displayed moderate levels of agreement. One can expect to see some variation in level of agreement for these types of variables, as they are to some extent based on subjective assessments and thereby inherently difficult to classify in a consistent manner [21]. A study investigating inter-rater reliability for clinical assessment of stroke found no better agreement between clinicians assessing focal symptoms in the same patients than what we found in the present study [22]. Another problematic variable was the recording of heart rate monitoring. Almost all of the registrations in this variable fell within two categories; ECG alone or a combination of several modalities (ECG, telemetry, Holter monitoring). However, the poor level of agreement suggests that a clarification of the response categories in this variable can be useful.
The Norwegian Stroke Register's User Manual does not provide any criteria for recording whether swallowing function was tested. Given the poor reliability of this variable, there seems to be some confusion as to the definition of such a test. For the variable mobilized out   of bed <24 h after admission the raw data showed that the main difference is that the nurse chose the category "unknown" to a larger extent than the stroke register. This could indicate that for this variable, it is difficult to find clear information in the medical records. In further quality enhancement work, the Stroke Register should look into the possibility of better definitions or explanations of these variables. This study has several limitations. First, differences in data collection methods between the nurse and the Stroke Register may have had an impact on the results. This is particularly applicable for the date variables, as these are mandatory fields in the electronic web-based form of the Norwegian Stroke Register, while the nurse using paper forms did not have any validation rules affecting her registrations. Second, the nurse recorded data for four different hospitals, while the stroke register contains data collected by one registrar per hospital. Third, the sample size is not sufficient to perform analyses at the level of each hospital. Finally, 2012 was the first year after the stroke register was established as a national register and most hospitals had limited experience with documentation in the medical records and registration in the register. Hence the results might improve in the future.
Additionally, when making inferences based on reliability studies, there is an inherent challenge in determining whether discrepancies between the two raters' registrations are due to factors related to the quality of the hospital medical records, the quality of variables in the register or the quality of the registration work by the raters.

Conclusion
A majority of the variables in the Norwegian Stroke Register have substantial to excellent reliability. The problem areas seem to be the level of incompleteness in the time variable related to stroke onset and poor reliability in some variables concerning examinations, clinical findings and treatment during hospitalization. The study points to ambiguous definitions of variables or response categories as well as difficulties in finding or interpreting information in medical records as possible explanations to the discrepancies. Steps should be taken to improve the completeness and reliability of the variables in question.