Towards Standardization of Quantitative Retinal Vascular Parameters: Comparison of SIVA and VAMPIRE Measurements in the Lothian Birth Cohort 1936

Purpose Semiautomated software applications derive quantitative retinal vascular parameters from fundus camera images. However, the extent of agreement between measurements from different applications is unclear. We evaluate the agreement between retinal measures from two software applications, the Singapore “I” Vessel Assessment (SIVA) and the Vessel Assessment and Measurement Platform for Images of the Retina (VAMPIRE), and examine respective associations between retinal and systemic outcomes. Method Fundus camera images from 665 Lothian Birth Cohort 1936 participants were analyzed with SIVA and VAMPIRE. Intraclass correlation coefficients (ICC) and Bland-Altman plots assessed agreement between retinal parameters: measurements of vessel width, fractal dimension, and tortuosity. Retinal–systemic variable associations were assessed with Pearson's correlation, and intersoftware correlation magnitude differences were examined with Williams's test. Results ICC values indicated poor to limited agreement for all retinal parameters (0.159–0.410). Bland-Altman plots revealed proportional bias in the majority, and systematic bias in all measurements. SIVA and VAMPIRE measurements were associated most consistently with systemic variables relating to blood pressure (SIVA r's from −0.122 to −0.183; VAMPIRE r's from −0.078 to −0.177). Williams's tests indicated significant differences in the magnitude of association between retinal and systemic variables for 7 of 77 comparisons (P < 0.05). Conclusions Agreement between two common software applications was poor. Further studies are required to determine whether associations with systemic variables are software-dependent. Translational Relevance Standardization of the measurement of retinal vascular parameters is warranted to ensure that they are reliable and application-independent. This would be an important step towards realizing the potential of the retina as a source of imaging-derived biomarkers that are clinically useful.


Introduction
Retinal microvascular features and their changes identified from the analysis of fundus camera images have been associated with cardiovascular disease, hypertension, stroke, dementia, and cognitive impairment. [1][2][3][4][5] The development of semiautomated software applications, such as Singapore ''I'' Vessel Assessment (SIVA; National University of Singapore, Singapore), Vessel Assessment and Measurement Platform for Images of the Retina (VAMPIRE), [6][7][8] Interactive Vessel Analysis (IVAN; University of Wisconsin, Madison, WI), Quantitative Analysis of Retinal Vessel Topology (QUARTZ), 9 Retinal Analysis (RA; Department Ophthalmology & Visual Science, University of Wisconsin), and Automated Retinal Image Analyser (ARIA) 10 has enabled the measurement of quantitative retinal parameters with increasing efficiency. 11 While these measurements offer great potential to examine the role of microvascular pathology in the pathophysiology of cerebral and cardiovascular diseases, 12 the applicability of retinal measurements in the clinical setting is yet to be fully established, partly due to methodologic limitations.
In the absence of ground truth measurement, the accuracy (i.e., the degree to which an instrument measures the true value of a variable) of such software applications cannot be determined. Method-comparison studies, however, assess how measurements from different software applications agree/differ. Agreement can be examined in two ways; firstly, by comparison of the absolute raw values produced by each software application (absolute agreement) and secondly, by comparison of the individual differences (i.e., differences between individual scores when ranked along a continuum) measured by each application (individual differences agreement). While some studies have identified several potential sources of variation, including physiologic characteristics of the eye, angle of imaging, image quality, cameras, thresholding and segmentation methods, and intraobserver variability, [13][14][15] few studies have compared measurements from different software applications, which is the focus of this study. Current measurements of retinal vascular parameters might vary depending on the software from which they are derived. For instance, the absolute or ''raw'' value of the central retinal artery equivalent (CRAE; a summary of arteriolar vessel width) measured using software X might not be consistent with the absolute value of CRAE from the same image measured using software Y. A fundamental issue is whether differences in absolute (raw) measurements between the applications translate into meaningful differences in the detection of associations with systemic variables. If agreement across software in the measurement of individual differences is good (i.e., software applications X and Y rank individuals in a similar position along the measurement scale for a given retinal parameter), differences in absolute values might be less important. Systematic bias may not affect associations with systemic variables as long as the linear relationship between measurements from each method is the same. However, should retinal vascular measurements from fundus image analysis be applied as in optical coherence tomography measurements of the retinal nerve fiber layer in the diagnosis of glaucoma and other optical pathologies, 16 or to determine inclusion in a clinical trial, a systematic bias would be very problematic.
Only moderate agreement of individual differences has been found previously between measurements of retinal vessel widths using IVAN and SIVA, meaning the error between the measurement and the true value was not constant for each software application. For instance, Yip et al. (IOVS 2012;53:ARVO E-Abstract 4113) reported an intraclass correlation coefficient (ICC) 0.516 and 95% confidence interval (CI) 0.41-0.61 for CRAE, and ICC 0.509 and 95% CI 0.40-0.6 for central retinal vein equivalent (CRVE). Significant intersoftware differences in CRAE and CRVE measurements (P , 0.001) from IVAN and SIVA also have been reported by Hao et al. 17 SIVA returned systematically larger measurements of CRAE and CRVE compared to IVAN according to Wei et al. 18 These findings indicate that agreement among software is not strong in terms of absolute measurements, and the low-to-moderate ICC values reported by Yip et al. (IOVS 2012;53:ARVO E-Abstract 4113) suggest that precision or agreement in the measurement of individual differences also may be poor.
Few studies have explicitly compared the strength of associations between retinal parameters and systemic variables across different applications. Yip et al. 19 measured CRAE and CRVE using three software applications (RA, IVAN, and SIVA), and assessed the degree of agreement between their associations with systemic factors, including blood pressure, cholesterol levels, and body mass index. Though there were large differences in the absolute values of CRAE and CRVE among the software (CRAE, mean difference ¼ À6.7 to À21.8 lm; CRVE À7.7 to À18.2 lm), Pearson's correlation coefficients (assessing individual differences in agreement for the retinal measures from each software) were high (r's from 0.762-0.895) and there were no significant differences in the strength of their associations with systemic variables. Therefore, it remains unclear whether one software has greater predictive utility than another.
We used retinal imaging data from the Lothian Birth Cohort 1936 (LBC1936) to investigate the variation between two well-used software applications: SIVA and VAMPIRE. To the best of our knowledge, there has been no comprehensive assessment of the association between retinal parameters measured using different software applications beyond individual studies of summary vessel width measures and fractal dimension. We determined agreement in absolute (raw values) and individual (precision) differences between measurements of widths, tortuosity, and fractal dimension from SIVA and VAMPIRE. Furthermore, we determined the strength of association of measurements taken from both software applications with well-established systemic variables (blood pressure, inflammatory markers, and large-artery atheroma), and examined potential differences in the magnitude of these associations. We also discussed areas affecting reliability of semiautomated retinal measurements with a view towards standardization within the field.

Methods
Ethical permission for the LBC1936 study was obtained from the Lothian Research Ethics Committee (Wave 1, LREC/2003/2/29), the Multi-Centre Research Ethics Committee for Scotland (Wave 1, MREC/01/0/56), and the Scotland A Research Ethics Committee (Wave 2, 07/MRE00/58). Written informed consent for participation in the study was obtained from all participants. The research was carried out in compliance with the Helsinki Declaration.

Participants
Data were drawn from a subsample of the LBC1936 study. The LBC1936 comprises 1091 community-dwelling, healthy older adults, mostly free of diseases affecting the vasculature, and with a very narrow age range. Most participants completed the Moray House Test No. 12 20 of verbal reasoning at a mean age of 11 as part of the Scottish Mental Survey of 1947 (SMS1947). 21,22 Between 2004 and 2007, those residing in Edinburgh and the Lothians who may have taken part in the SMS1947, who then were approximately age 70, were contacted and invited to participate in the LBC1936 study. Recruitment and testing of this cohort has been described in detail previously. 23,24 Data for the present study were obtained between 2008 and 2010 when the participants were approximately 73 years old (N ¼ 866). Analyses were based on a subsample with retinal images suitable for analysis (n ¼ 665).

Retinal Image Analysis
Images of the right and left retinas were captured using a nonmydriatic camera and a 458 field of view (CRDGi; Canon USA, Inc., Lake Success, NY). Retinal parameters from one eye were measured using SIVA (version 3.0). If both images were of the same quality, the right eye was chosen (n ¼ 343). If unavailable or ungradable, the image of the left eye was used (n ¼ 322). Quality assessment was performed visually by a trained software operator following a standard protocol. The main reasons for image rejection included images being centered overly towards the macula (too few vessels visible); images with known pathologies, including asteroid hyalosis and cataract; and cases of very poor image quality, including out-of-focus images, eyelashes causing streaks across the photograph, small pupil size leading to dark or graining images, and overexposure. SIVA measurement and summarization methods have been described fully previously. 25,26 Retinal parameters from the same images analyzed using SIVA were measured using VAMPIRE (version 3.1). VAMPIRE measurement and summarization procedures have been described in detail previously. 6,8,[27][28][29][30] A single trained operator was responsible for the visual assessment of automated measurements with each software application (MK, SIVA; SM, VAM-PIRE), performing manual intervention where necessary, according to software-specific standardized measurement protocols. See Supplementary Table  S1 for details of the main operator interactions. Supplementary Figure S1 presents the user interface of both software applications.

Measures
Retinal Parameters.
CRAE, CRVE, arteriole-to-venule ratio (AVR), tortuosity (TORTa, TORTv), and fractal dimension (FDa, FDv) of the vasculature were calculated using SIVA and VAMPIRE. Separate arteriolar and 3 TVST j 2018 j Vol. 7 j No. 2 j Article 12 McGrory et al. venular measures were indicated by lowercase ''a'' or ''v.'' Measurements zones within which to measure retinal parameters were set (by SIVA and VAMPIRE) in relation to the center of the optic disc (OD) and its size. Vessel width measurements were derived from within Zone B (an annulus 0.5-1 disc diameter from the optic disc margin) and tortuosity and fractal dimension are measured from vessels within Zone C (an annulus 0.5-2.0 disc diameters from the disc margin). The labels of each retinal variable and the zone within which they are measured were the same for both software applications. The applications derived the same outcomes, but the underlying measurement algorithms differed. A description of all retinal parameters and zones is provided in Supplementary Table S2. SIVA and VAMPIRE measured CRAE and CRVE in pixels from the images. The same method was applied for both software applications to convert pixel measurements to absolute measurements in microns. This was based on the assumption of an average OD diameter of 1800 lm in an adult human, adopted commonly in the literature. 31 While with VAMPIRE the mean OD diameter (in pixels) of the entire sample was used to derive an image conversion factor (ICF), the procedure for SIVA was to calculate the ICF by measuring the OD diameter in a subsample of images (10%). We could have changed the conversion method of one software to match the other, thus eliminating one source of variability between software systems; however, the convention for conversion to micrometers was followed for VAMPIRE and SIVA according to their respective measurement protocols to better assess agreement without manipulation of the data to potentially increase agreement.

Systemic Variables.
Systemic variables were assessed, concurrently with obtaining the retinal images. We selected those factors that previously have been associated with retinal measures, including blood pressure, cardiovascular disease, diabetes, and inflammation, 32 and those that have been used previously to examine and comprehensively adjust for vascular risk in the LBC1936. 33 Those factors were: hypertension (selfreported history), mean systolic and mean diastolic blood pressure (mean of three sitting BP readings, mm Hg), ankle-brachial pressure index, carotid intima-media thickness, hemoglobin A1c, plasma high-density lipoprotein cholesterol, C-reactive protein, von Willebrand factor, and interleukin-6 (IL-6).
All variables, with the exception of self-reported hypertension, were treated as continuous variables.

Statistical Analysis
Analyses were conducted using SPSS V.21 (IBM, New York, NY) and R (version 1.0.136). All variables were examined for normality before analysis. Outlying values (63 standard deviations [SD]) were winsorized to minimize the influence of extreme outliers without losing relevant data. Tortuosity, Creactive protein, IL-6, and carotid measures were logtransformed to improve their distributions, which were positively skewed. Two-way mixed model ICCs were used to evaluate the extent of correspondence between two methods (SIVA and VAMPIRE) for measuring the same parameter (e.g., CRAE). The ICC quantifies this agreement, combining a measure of correlation with a test of the difference in means correcting for systematic bias and agreement based on chance alone. ICCs are thought to be more appropriate for assessing whether two methods for measuring a quantitative parameter provide similar results than Pearson's r, which measures the extent to which two variables are linearly dependent. 34 Method-comparison studies have demonstrated that a perfect linear relationship does not necessarily reflect good or even moderate agreement as measured by ICC. 34,35 In cases where a systematic bias in the data is observed, Pearson's correlation may indicate high correlation despite poor agreement between values as the linear relationship between measurements would be unaffected. 35 ICC results were interpreted using the following criteria: 0.00-0.49 ¼ poor, 0.50-0.74 ¼ moderate, and 0.75-1.00 ¼ excellent. 36 Single measure coefficients and 95% CIs are reported.
We used Bland-Altman plots 37,38 to provide a visual representation of how differences between the measurements relate to the mean across the full-range of values. The difference between two measurements (VAMPIRE-SIVA) was plotted against the average of the two measurements with the 95% limits of agreement (LOA), defined as the mean difference 6 1.96 3 SD. Narrower 95% LOA indicates higher agreement. The presence of systematic, or fixed bias was assessed using a one-sample t-test comparing the mean difference and zero value. Proportional bias also was tested by determining whether the slope of the regression line significantly differed from zero.
Finally, we used bivariate correlations to assess the associations between retinal measurements and systemic variables. Where both variables were continu-ous, we used Pearson's correlation, and biserial correlation was used when one variable was dichotomous. To minimize the potential for type 1 errors, P values were adjusted according to the false discovery rate using the p.adjust() function in the statistical software R, using the method of Benjamini and Hochberg. 39 Briefly, the false discovery rate method uses an ensemble of hypothesis tests and sets the P value, which results in 5% of those hypothesis tests being false-positives. We used Williams's test, 40 implemented in R (paired.r command), to examine potential differences between SIVA and VAMPIRE in the magnitude of the correlations between each systemic variable and retinal measurements. Williams's test is used to test for significant differences in magnitude of correlations between a predictor variable (e.g., blood pressure) and competing criterion variables (VAMPIRE and SIVA measurements) that themselves are correlated. Table 1 describes the characteristics of the study sample. Participants with at least one retinal measurement suitable for analysis, that is, the vasculature could clearly been seen in the image and was segmented by both software applications, thus enabling completion of the measurement process, (n ¼ 665; 328 female; 337 male) had a mean age of 72.5 years (SD ¼ 0.70) at the time retinal photographs were taken. Mean absolute values of retinal measurements from VAMPIRE and SIVA are reported in Table 2. Scatterplots of the relationship between retinal measurements using VAMPIRE and SIVA are presented for all variables (Figs. 1-3).

Results
ICCs for correspondence of retinal parameter measures between SIVA and VAMPIRE imaging software measurements are shown in Table 2. ICCs indicated that agreement between all measures was CVD, Cardiovascular disease; HbA1C (mmol/mol), glycated hemoglobin; CRP, C-reactive protein (mg/L); IL-6, pg/mL; HDL, high-density lipoprotein (mmol/L); vWF, von Willebrand factor; ABPI, ankle-brachial pressure index (systolic blood pressure [BP] in ankle/systolic BP in arm by Doppler ultrasound and random zero sphygmomanometer); IMT, intima-media thickness (mean over 1 cm segment of common carotid artery).    6 show Bland-Altman plots illustrating the agreement between VAMPIRE and SIVA for measurement of seven retinal vascular parameters. As there is no ''reference'' standard, all differences were calculated such that SIVA measurements were subtracted from VAMPIRE. Average differences between VAMPIRE and SIVA, and systemic and proportional bias are described in Table 3.
Mean difference in FDa (see Fig. 6 Tables 4 and 5 show the associations between systemic measures and retinal parameters measured by SIVA and VAMPIRE. While most significant associations between retinal parameters and systemic variables were found for only one or other software package, there were six instances of consistent significant associations for measurements from SIVA and VAMPIRE. There were significant negative associations between retinal vascular measures from both software and mean systolic and diastolic BP: diastolic BP (CRAE, r's from À0.183 to À0.118; AVR, r's À0.162 to À0.148; FDa, r's À0.169 to À0.107) and systolic BP (CRAE, r's À0.177 to À0.112; FDa, r's À0.127 to À0.118). These associations remained significant after correcting for false discovery rate. There was one significant positive association between IL-6 and TORTv for SIVA (r ¼ 0.098) and VAMPIRE (r ¼ 0.096). These associations did not survive correction for false discovery rate. Supplementary Figure S2, as an example of one such instance of significant and consistent associations across software packages, shows a visual representa- Four demonstrated a significantly stronger association with SIVA measurements than VAMPIRE, these were CRVE and IL-6 (r ¼ 0.114 vs. r ¼ À0.003; P ¼ 0.03), AVR and C-reactive protein (r ¼À0.154 vs. r ¼ À0.066, P ¼ 0.05), AVR and IL-6 (r ¼ À0.122 vs. r ¼ À0.033, P ¼ 0.04), and TORTa and ankle-brachial pressure index (r ¼ À0.045 vs. r ¼ À0.026, P ¼ 0.04). The three associations where VAMPIRE demonstrated significantly stronger associations with systemic variables than SIVA were CRAE and right carotid intima-media thickness (r ¼À0.114 vs. r ¼À0.013, P ¼ 0.03), FDa and left carotid intima-media thickness (r ¼À0.079 vs. r ¼ 0.043, P ¼ 0.01), and FDa and anklebrachial pressure index (r ¼ 0.054 vs. r ¼ À0.045, P ¼ 0.03). It should be noted that in two of these cases significant differences in magnitude were based on comparison of two small and nonsignificant correlations lying on either side of zero (FDa and anklebrachial pressure index r ¼ À0.045 vs. r ¼ 0.054; TORTa and ankle-brachial pressure index r ¼À0.045 vs. r ¼ 0.026).
Overall, the association between SIVA and VAM-PIRE retinal vessel width, fractal dimensions and tortuosity measures was poor, and assessment of agreement between measures using Bland-Altman plots indicated the presence of systemic and proportional bias for the majority of parameters. Of the 77 Figure 4. Vessel width. Bland-Altman plot of differences between CRAE, CRVE, and AVR measurements from VAMPIRE and SIVA, plotted against the average of the two methods. Broken red lines represent the mean difference (2 SD) of the difference (95% limits of agreement). Broken grey lines represent 95% CI on bias, upper, and lower limits of agreement. CRAE and CRVE measured in micrometers. AVR, as a dimensionless measure, is not measured in standard units. 8 TVST j 2018 j Vol. 7 j No. 2 j Article 12 McGrory et al. Figure 5. Tortuosity. Bland-Altman plot of differences between log TORTa and log TORTv measurements from VAMPIRE and SIVA, plotted against the average of the two methods. Broken red lines represent the mean difference (2 SD) of the difference (95% limits of agreement). Broken grey lines represent 95% CI on bias, upper, and lower limits of agreement. Tortuosity, as a dimensionless measure, is not measured in standard units. Figure 6. Fractal dimension. Bland-Altman plot of differences between FDa and FDv measurements from VAMPIRE and SIVA, plotted against the average of the two methods. Broken red lines represent the mean difference (2 SD) of the difference (95% limits of agreement). Broken grey lines represent 95% CI on bias, upper, and lower limits of agreement. Fractal dimension, as a dimensionless measure, is not measured in standard units. retinal-systemic associations examined, six were significant for SIVA and VAMPIRE measurements. Five of these remained significant following correction for false discovery rate. Significant differences between software in the strength of correlation with systemic variables was found for seven of 77 comparisons.

Discussion
Examination of Bland-Altman plots indicated that systematic bias was present for all retinal measurements; depending on the specific parameter, one software reported consistently higher or lower measurements than the other. Proportional bias, indicating variability in differences across the measurement range, also significantly contributed to a lack of agreement for most retinal parameters. Therefore, our findings did not support equivalence of retinal vascular measurements between software applications indicating that absolute measurements were not interchangeable. An important consequence was that values from different systems would require standardization for data pooling or meta-analysis. Individual differences agreement as measured by ICC also were poor, with all being considerably lower than the proposed 0.90 minimum level of agreement for measures to be used interchangeably in clinical practice. 41 This leaves uncertainty over whether associations with systemic variables also are software dependent. Williams's tests showed significant differences in correlation magnitude for seven of 77 comparisons; however, conclusions regarding the equivalence of SIVA and VAMPIRE associations with systemic variables should be drawn cautiously given the low number of significant retinal-systemic associations after false discovery rate adjustment from either software found in the current sample. It is important to note, however, that the weak associations reported here do not imply poor retinal parameters. Our relatively homogeneous sample may have restricted the range of values required to identify stronger correlations. Associations might be stronger in clinically relevant samples. Comparability of retinal-systemic variable associations in broader populations, including those with disease, should be examined. Nevertheless, our results remain of value as an examination of agreement in a well-characterized sample of healthy older adults.
Current results correspond with, though ICCs still are lower than, those of Yip et al. (IOVS 2012;53:ARVO E-Abstract 4113), who reported moderate agreement between the width measurements obtained from SIVA and IVAN in a middle-aged population. Associations between SIVA and VAMPIRE and systemic parameters were weaker than those found by Yip et al. 19 with SIVA, IVAN, and RA. Sample age differences may account for the weaker results in our study as stronger cross-sectional associations have been reported in middle-aged compared to older   there is no ultimate ground truth upon which to assess measurements from fundus camera image analysis. ''Errors'' cannot be defined as in metrology or physics, as our references themselves are affected by uncertainty. Such uncertainties are likely to contribute to the lack of agreement in the current results. Lack of agreement can arguably be traced to two classes of factors. First, error variance may be introduced by factors that affect both software applications (See Supplementary Discussion). Second, there may be factors specific to either SIVA or VAMPIRE, such that one has good reliability while the other does not. In the following section we focus on the latter, and the impact these issues may have had on the results reported. Each factor provides an opportunity for progressing towards greater standardization of retinal measurements from fundus camera images.
The low agreement of some measurements may result from different solutions to obtaining retinal parameters (i.e., the combination of image postprocessing techniques and numerical algorithms) implemented in SIVA and VAMPIRE. This leads to the possibility that some measurements from one application may have better reliability than those from the other, which again would result in poor intersoftware agreement. The extent to which such differences are relevant for clinical investigations must be determined. The following examines briefly the eight main procedures involved.
Vasculature detection. Retinal vessel measurements depend crucially on the detection of the retinal vasculature. Simply put, this involves the generation of a map showing which pixels are classified as either vessel or background. However, the maps generated by different algorithms for the same images will vary in terms of which vessels are detected (especially thin ones), and of their width and centerlines, for example. VAMPIRE uses a modified 2-D Gabor waveletsupervised classification algorithm for automatic vessel detection 43 and locates centerlines through skeletonization of the binary vessel map. 44 SIVA incorporates Daubechies wavelet, trench, and curvature-based segmentation to identify and extract retinal vascular structure with a modified trench detection algorithm applied in vessel segmentation to locate the centerlines of the vessels. 26 Unfortunately, as segmentation data were not available from both software packages, we could not assess the effect of these differences explicitly. Finally, threshold levels for the identification of very small vessels may differ. 45 Determination of OD diameter. Differences in the determination of the OD diameter, through automatic software detection or inter/intraoperator differences also may influence measurements as the regions in which vessels are measured are defined by OD diameter. For example, in the case of AVR, the average vessel width will change according the size of the radius of Zone B and the inclusion and exclusion of vessels in the AVR calculation also will differ according to the region set by the OD diameter. OD measurements were not available from both software tools, precluding the assessment of agreement and effect of different OD diameter determination.
Vessel Width. Accurate quantification of vessel widths using semiautomated computer software has proven difficult. 46 The lack of an absolute, objective definition of the retinal vessel boundary 8,7 and of ground truth in vivo increases the uncertainty of measurements, normally validated against experts' annotations obtained from interactive packages. 46,47 Computer-assisted quantification of vessel width from fundus camera imaging measures the width of the reflective erythrocyte column. As the surrounding clear plasma zone is not measured, true vessel diameter is underestimated. 48 The vessel edge is not clear due to loss of intensity with reducing column depth at the boundary, which makes it difficult to determine whether individual pixels at the vessel edge belong to a vessel. While this is not likely to cause a large degree of variation, differences found in average single vessel width from each quadrant across 20 images measured using ARIA and IVAN (5.56-7.94 and 7.44-19.73 lm for arteriolar and venular width, respectively) were explained due to the use of different methods in defining the vessel edge by the two systems (Silvestri V, et al. IOVS 2012; 53:ARVO E-Abstract 2178).
Arteriole-to-Venule Ratio. Although one of the most standardized retinal measurements, AVR from different applications will differ as it is based on CRAE and CRVE, which vary depending on factors, such as the number of vessels measured. In our study, differences between AVR from VAMPIRE and SIVA ranged from 0. 16  Tortuosity. Differences in the vessel segments that SIVA and VAMPIRE sample for tortuosity and in how these are combined into a single measure per image (i.e., the use of mean, weighted mean, or median to derive a single tortuosity value) are likely. Poor agreement of tortuosity values also could be due to validation differences (testing against clinical judgement after conversion to a 3-or 4-point scale, testing against different ground truth sets, annotated in turn by different experts), and the use of different computational definitions of tortuosity measured within different numerical ranges. 29 Fractal Dimension. FD measurements are sensitive to variations in segmentation, image acquisition angle, and FD calculation algorithms. 15 Differences between software in vessel detection and in the fractal computation in our study will have contributed to the different values that were returned. SIVA performs monofractal analysis using the box-counting method, 45 whereas VAMPIRE performs monofractal using box counting and multifractal analysis using the generalized sand-box method. 49 It should be noted, however, that in our experiments the multifractal VAMPIRE FD used in the current analysis had better agreement with FD analyzed by SIVA (ICCs from 0.373 to 0.410) than a monofractal FD calculated by VAMPIRE (ICCs from 0.192-0.195).
Algorithm Based Measurement. The current practice of estimating vessel width relies upon derived parameters (CRAE and CRVE) that are based on an algorithm, in itself is an estimation. As vessel width is not derived from direct measurement of a physiologic or anatomic feature, the reliability of the measurement, thus, is dependent on reliability of the algorithm. Algorithms also are used in measurement of other parameters that have proven difficult to quantify directly, such as tortuosity, for which numerous algorithm variations exist attempting to capture the curvature of vessels.
Pixel to Micron Conversion. For the purposes of comparison across imaging systems and software packages, pixel measurements of CRAE and CRVE frequently are converted to standard units of length (i.e., micrometers). The same approach was used in this study for SIVA and VAMPIRE (1800 lm/ average OD diameter), as recommended by the SIVA protocol. 26 However, assuming the make and model of the camera system and the angle of acquisition are consistent, then the conversion factor depends on the average OD diameter measurement in the specific patient sample. Considerable variation exists in OD size within and between populations. 50 The use of a sample-specific conversion factor based on the sample average OD diameter (and on assumed 1800 lm average OD diameter) may have important implications for retinal measurements; this becomes evident when retinal measurements in micrometers based on the standard image conversion factor (ICF ¼ 1800 lm/average OD diameter) are compared to those calculated using a conversion factor unique to an individual (1800 lm/individual OD diameter). We ran this test with a set of 10 LBC1936 images (two images selected at each quintile point of OD diameter range), converting VAMPIRE CRAE and CRVE measurements from pixels to micrometers using the standard and an individual conversion factors (See Supplementary Table S3). Individual OD diameters varied from the sample mean by up to 367 pixels (the average diameter being 428.92 pixels, SD ¼ 48.14); this resulted in conversion factor differences of up to 3.6 (sample data ICF range ¼ 6.  Table S4). Crucially, the variation introduced by using a standardized conversion factor (up to 76 lm) is larger than differences reported between patient groups (e.g., CRVE of 218 lm in lacunar stroke vs. 208 lm in cortical stroke), 5 and consequently may mask true differences between individuals. Using pixel measurements in statistical analyses, other factors being equal, removes the uncertainty introduced by pixels-micrometers conversion factors. Measurements in micrometers always can be obtained for the ophthalmologist's benefit, but must be interpreted carefully within the limits of the approximations or standardizations implied by the conversion method.
Some factors introducing uncertainties in retinal imaging are common to other areas of medical imaging. [51][52][53] For example, speckle tracking echocardiography measurements were dependent on the algorithm applied by the specific software system used. 51,53 Different algorithms also were a contribut- 13 TVST j 2018 j Vol. 7 j No. 2 j Article 12 McGrory et al.
ing factor to the poor agreement between three widely used software applications with myocardial perfusion imaging. 52 A model eye allowing calibration with measurements of a known size could help to improve accurary. 54 Phantoms of a known-size have been used in computed tomography, magnetic resonance imaging, and positron emission tomography imaging allowing for accuracy and variance to be analyed. 55,56 Phantoms of the eye, including the retinal vasculature with simulated blood flow and known dimensions, could enable the calibration of measurements. Our study has a number of strengths. We expanded upon the current literature in two ways, first by assessing agreement between measurements from two software applications on a comprehensive range of retinal vascular parameters. Previous studies have focused on individual parameters, for example, vessel width or fractal dimension and differences between software applications ( 15,18 Our study extended analysis to measures of tortuosity and we examined all parameters within a single dataset. Second, we assessed the equivalence of a range of retinal parameter associations with systemic variables across software applications, which also has been limited previously to vessel width measurements. The relatively large sample of healthy older adults was taken from a birth cohort with a very restricted age range, with similar geographic background, and mostly free of illnesses known to affect the vasculature, which enabled us to examine differences between the software applications without the risk of confounding by these important variables. The use of one trained operator for all image analyses in each of the software applications reduces potential for error due to interoperator variability. Limitations of the study also should be noted. To the best of our knowledge, no extensive quantitative study of the stability of retinal measurements as a function of image quality (e.g., resolution, acquisition protocol, instruments) exists to date. At the current stage of software development, validation is not collaborative across research groups. Therefore, there are known differences between software in the formulas and methodology of retinal measurements which are beyond the scope of discussion in the current study. The absence of OD detection and segmentation data from both software applications leaves the agreement of and effect of these differences to be explored explicitly in further studies. Our study was limited to a comparison of two commonly used software packages. Future studies should assess agreement of measurements from a wider range of available software packages. A comprehensive analysis of agreement, performed by an international consortium, would be of great value in further clarifying the current standards of agreement and in extending the discussion towards greater standardization across software applications.

Conclusion
Semiautomated retinal vasculature analysis measurements seem to be software-dependent. Based on the current results, we recommend caution when making inferences regarding the equivalence of associations between retinal measures from SIVA and VAMPIRE (and other similar software applications) and systemic variables, due to the limited number of corresponding associations between retinal measures and systemic variables found. It also is important to consider the variability in measurements when comparing results of retinal vascular measurements from different studies as some factors that contribute to this will arise from image acquisition settings. Improvements to summarization or targeting of vessel width and tortuosity measurements may enhance agreement and consequently increase the efficacy of these measures in clinical settings. A collaborative retinal imaging summit of clinicians and software developers would be of immense value in progressing standardization, identifying areas where there is a need for improvement, anticipating developments in imaging and measurement technology, developing recommendations, and facilitating consensus development of best practices. Future studies examining the impact of software-specific variability in relation to normative values and clinical cut-offs would further stimulate research into the retina as a source of reliable and accurate biomarkers.