Scoring reading parameters: An inter-rater reliability study using the MNREAD chart

Purpose First, to evaluate inter-rater reliability when human raters estimate the reading performance of visually impaired individuals using the MNREAD acuity chart. Second, to evaluate the agreement between computer-based scoring algorithms and compare them with human rating. Methods Reading performance was measured for 101 individuals with low vision, using the Portuguese version of the MNREAD test. Seven raters estimated the maximum reading speed (MRS) and critical print size (CPS) of each individual MNREAD curve. MRS and CPS were also calculated automatically for each curve using two different algorithms: the original standard deviation method (SDev) and a non-linear mixed effects (NLME) modeling. Intra-class correlation coefficients (ICC) were used to estimate absolute agreement between raters and/or algorithms. Results Absolute agreement between raters was ‘excellent’ for MRS (ICC = 0.97; 95%CI [0.96, 0.98]) and ‘moderate’ to ‘good’ for CPS (ICC = 0.77; 95%CI [0.69, 0.83]). For CPS, inter-rater reliability was poorer among less experienced raters (ICC = 0.70; 95%CI [0.57, 0.80]) when compared to experienced ones (ICC = 0.82; 95%CI [0.76, 0.88]). Absolute agreement between the two algorithms was ‘excellent’ for MRS (ICC = 0.96; 95%CI [0.91, 0.98]). For CPS, the best possible agreement was found for CPS defined as the print size sustaining 80% of MRS (ICC = 0.77; 95%CI [0.68, 0.84]). Absolute agreement between raters and automated methods was ‘excellent’ for MRS (ICC = 0.96; 95% CI [0.88, 0.98] for SDev; ICC = 0.97; 95% CI [0.95, 0.98] for NLME). For CPS, absolute agreement between raters and SDev ranged from ‘poor’ to ‘good’ (ICC = 0.66; 95% CI [0.3, 0.80]), while agreement between raters and NLME was ‘good’ (ICC = 0.83; 95% CI [0.76, 0.88]). Conclusion For MRS, inter-rater reliability is excellent, even considering the possibility of noisy and/or incomplete data collected in low-vision individuals. For CPS, inter-rater reliability is lower. This may be problematic, for instance in the context of multisite investigations or follow-up examinations. The NLME method showed better agreement with the raters than the SDev method for both reading parameters. Setting up consensual guidelines to deal with ambiguous curves may help improve reliability. While the exact definition of CPS should be chosen on a case-by-case basis depending on the clinician or researcher’s motivations, evidence suggests that estimating CPS as the smallest print size sustaining about 80% of MRS would increase inter-rater reliability.

Enter a financial disclosure statement that   The primary goal of this work is to evaluate inter-rater reliability when human raters estimate reading performance using the MNREAD acuity chart. Our motivation for this study was the lack of evidence that different extraction methods used by different raters would lead to comparable estimates of reading performance, which is especially relevant in the context of multicenter studies, or when looking at follow-up data. Our results demonstrate excellent interrater reliability for the Maximum Reading Speed (i.e. the fastest that one can read when print size is not limiting) and good inter-rater reliability for the Critical Print Size (i.e. the print size for which reading speed is maximum). Our work also provides further tips and instructions on how to score noisy and/or incomplete MNREAD data. These tips may serve as a starting point to help clinicians and researchers reduce variability. Conclusion: For MRS, inter-rater reliability is excellent, even considering the possibility of noisy 57 and/or incomplete data collected in low-vision individuals. For CPS, inter-rater reliability is 58 lower, which may be problematic, for instance in the context of multicenter studies or follow-up 59 Introduction 64 Reading difficulty is a major concern for patients referred to low-vision centers [1]. Therefore, 65 most Quality-of-Life questionnaires assessing the severity of vision disability contain one or 66 more items on subjective reading difficulty [2-5]. However, substantial discrepancy has been 67 observed between self-reported reading difficulty and measured reading speed [6]. Among the standardized tests available, the MNREAD acuity chart can be used to evaluate 74 reading performance for people with normal vision or low vision in clinical and research 75 environments [7]. In brief, the MNREAD chart measures four parameters that characterize how 76 reading performance changes when print size decreases: the maximum reading speed (MRS), the 77 critical print size (CPS), the reading acuity (RA) and the reading accessibility index (ACC) [8]. 78 The reading acuity and reading accessibility index are clearly defined by the number of reading 79 errors made at small print sizes and the reading speeds for a range of larger sizes. In the original 80 MNREAD manual, provided with the chart, MRS and CPS are defined as follows: "The critical 81 print size is the smallest print size at which patients can read with their maximum reading speed. 82 […] Typically, reading time remains fairly constant for large print sizes. But as the acuity limit is 83 approached there comes a print size where reading starts to slow down. This is the critical print 84 size. The maximum reading speed with print larger than the critical print size is the maximum 85 reading speed (MRS)." In short, values for MRS and CPS depend on the location of the flexion 86 6 point in the curve of reading speed versus print size (Fig 1). In normally sighted individuals, for 87 whom the MNREAD curve usually exhibits a standard shape (Fig 1-A), the above definitions 88 may be sufficient to extract MRS and CPS confidently by inspecting the curve. However, they 89 can be difficult to determine, especially for readers with visual impairments, who may experience 90 visual field defects (e.g. ring scotoma ; Fig 1-B) or the use of multiple fixation sites (i.e. PRL; Fig  91   1-C) [9]. In such cases, the noisy and/or incomplete dataset resulting from atypical visual 92 function may be inconsistent with the assumption that people will read at a fairly constant speed 93 until font size compromises their ability to identify words and MNREAD curves may take an 94 unusual shape (Fig 1-D). If so, subjective decisions (e.g. ignoring outliers) must be made by the 95 individual analysing the data (referred to as the "rater" in the present work, as opposed to the 96 "experimenter" who recorded the data). For this reason, MRS and CPS estimates may be 97 considered highly sensitive to inter-rater variability. Another approach to reduce variability is to fit the MNREAD curve and estimate its parameters 110 using automated algorithms [13]. In the present work, we will focus on two of these methods. 111 The first one has been described by the MNREAD creators [14,15] and is used in the MNREAD 112 iPad app [16]. It is also the most widely used in the literature [11,17,18]. In short, it determines 113 the CPS as the smallest print size that supports reading speeds that are not significantly different 114 from the reader's maximum reading speed; we will refer to it as the standard deviation method 115 (SDev). The second method, especially recommended with large but incomplete datasets, 116 estimates the critical print size from smooth curve-fit to the MNREAD data using non-linear 117 mixed effects (NLME) modeling [19]; we will refer to it as the NLME method. Both methods are 118 described in the Methods section. Despite the advantage of these algorithms in operationalizing 119 the estimation of the MNREAD parameters, they present two major drawbacks: (1)  (3) when looking at follow-up data involving different raters. 131 8 We have investigated the reliability of CPS and MRS estimates for MNREAD data collected 132 from participants with visual impairments. First, we evaluated the inter-rater reliability among 133 raters (Analysis 1). Second, we evaluate agreement between the NLME and SDev algorithms 134 (Analysis 2). Third, we evaluated agreement between raters and the two algorithms (Analysis 3). 135 136

138
Data from 101 participants with visual impairment were selected from a larger dataset, originally 139 collected to study the prevalence and costs of visual impairment in Portugal (PCVIP-study) 140 [22,23]. Only participants whose visual acuity in the better eye was 0.5 decimal (0.3 logMAR) or 141 worse and/or whose visual field was less than 20 degrees were selected for the present study. For each individual test, a corresponding MNREAD curve was plotted using the mnreadR 156 package [26] to display log reading speed as a function of print size (see S1 Appendix for all 101 157 curves). Because the shape of the curve can influence visual estimation of the reading parameters, 158 reading speed was plotted using a logarithmic scale so that reading speed variability (which is 159 proportional to the overall measure of reading speed) was constant at all speeds [14]. 160

161
Seven raters were recruited to estimate the MRS and CPS of each individual MNREAD curve. 162 Since inter-rater reliability may be influenced by raters' prior experience with the MNREAD 163 chart, we included raters with different levels of expertise in MNREAD parameters estimation. 164 Each rater gave a self-rated score of expertise (on a 5 point scale from 0 = 'no previous to 165 experience' to 4 = 'top expertise'), both before and after rating all the MNREAD curves, to 166 account for the amount of practice gained during the study. Each rater was provided with S1 167 Appendix, containing the 101 MNREAD curves to score. Raters were instructed to follow the 168 standard guidelines provided with the MNREAD chart instructions (see Introduction). However, 169 coming from patients with impaired vision, many of the curves had noisy or incomplete data, 170 which potentially made it difficult to estimate the MRS and CPS. In such cases, we provided 171 more detailed instructions to the raters. These detailed instructions are available in S2 Appendix. 172

173
MRS and CPS were also calculated automatically for each 101 datasets using two algorithm-174 based estimations: the 'standard deviation' method and non-linear mixed effects modeling. The 175 standard deviation method (SDev) uses the original algorithm described in [14] and [15] to 176 estimate the MNREAD parameters. This algorithm iterates over the data searching for an optimal 177 reading speed plateau, from which MRS and CPS will be derived. To be considered optimal, a 178 plateau must encompass a range of print sizes that supports reading speed at a significantly faster 179 rate (1.96 × standard deviation) than the print sizes smaller or larger than the plateau range (Fig  180   2). MRS is estimated as the mean reading speed for print sizes included in the plateau and CPS is 181 defined as the smallest print size on the plateau. In most cases, several print-size ranges can 182 qualify as an optimal plateau and the algorithm chooses the one with the fastest average reading 183 speed. In the present work, the standard deviation method estimation was performed using the 184 curveParam_RT () function from the mnreadR R package.  The non-linear mixed effects (NLME) modeling method is particularly suited for incomplete 199 datasets from individuals with reading or visual impairment [19]. The NLME model uses 200 parameter estimates from a larger group (101 datasets here) to allow suitable curve fits for 201 individual datasets that contain few data points. In the present work, we used an NLME model 202 with a negative exponential decay function, as described in details in [19], where a single 203 estimate of MRS can yield several measures of CPS depending on the definition chosen (e.g. 204 print size required to achieve 90% of MRS, 80% of MRS, etc.). Therefore, five values of CPS 205 were estimated, i.e. 95%, 90%, 85%, 80% and 75% of MRS. NLME modeling and parameters 206 estimation were performed using the nlmeModel () and nlmeParam () functions from mnreadR. 207 208

209
In all three analyses, intra-class correlation coefficient (ICC) was used to assess absolute 210 agreement between raters and/or algorithms [27]. This reliability index (ranging from 0 to 1; 1 211 meaning perfect agreement) is widely used in the literature in test-retest, intra-rater, and inter-212 rater reliability analyses [28]. In the present work, ICC values estimate the variation between two 213 or more methods (whether raters or algorithms) in scoring the same data by calculating the 214 absolute agreement between them. For each analysis, the appropriate ICC form (dependent on 215 research design and assumptions) was chosen by selecting the correct combination of "model", 216 "type" and "definition", as detailed in Table 1     SDev and NLME for each of them. The results are reported in Table 3. The strongest agreement 256 between the two automated methods was found for the 80% criterion, and was good, with an ICC 257 value of 0.77 (95% CI [0.68, 0.84]). Additionally, limits of agreement between the two 258 algorithms were estimated using Bland -Altman plots for both MRS and CPS (Fig 4). For MRS, 259 the average difference (i.e. bias) between the SDev method and the NLME model was 5.8 wpm 260 (i.e. 4.5%), with 95% limits of agreement of 11.4 wpm (i.e. 10%). For CPS (defined as 80% of 261 MRS, which showed the best agreement between methods), bias was 0.031 logMAR with 95% 262 limits of agreement of 0.06 logMAR (1 step unit being 0.1 logMAR). Overall, we concluded that 263 no significant difference could be observed between the two automated algorithms. 264 265 266  Table 4  283 shows the ICC values for each of the five CPS definitions). Overall, the NLME model showed 284 16 better agreement with the raters than the SDev method for both reading parameters.  In this project we investigated i) the agreement between raters for MNREAD parameters 308 extracted from reading curves (Analysis 1), ii) the agreement between SDev and NLME 309 automated methods extracting reading parameters from raw data (Analysis 2) and iii) the 310 agreement between raters and automated methods (Analysis 3). 311 312 Our first main result was that inter-rater reliability can be classified as excellent for MRS (ICC of 313 0.97) and good for CPS (ICC of 0.77). Because they are lower than 1, these agreement indexes 314 reveal the existence of discrepancies when extracting MNREAD parameters visually from 315 reading curves. Whilst the variability for MRS can be considered residual, the CPS estimation 316 may be questionable. On average, the range of difference in CPS estimates was 0.19 logMAR 317 (i.e. almost 2 lines on a logMAR chart), implying that the variability among raters can be 318 considered clinically significant and potentially problematic, for example when CPS is used to 319 prescribe optimal magnifying power. To identify the underlying factors of the discrepancies 320 observed in CPS rating, we considered whether the data itself could be involved, hypothesizing 321 that the modest ICC value that we found (0.77) was largely due to the presence of highly noisy 322 data. To confirm this hypothesis, we identified extreme outliers for which CPS values were three 323 times larger than the standard deviation of the mean. A total of five curves (5%) were identified 324 as extreme outliers (#2, #31, #58, #70 and #89 in S1 Appendix). What these curves have in