Equivalence of electronic and paper administration of patient-reported outcome measures: a systematic review and meta-analysis of studies conducted between 2007 and 2013

Objective To conduct a systematic review and meta-analysis of the equivalence between electronic and paper administration of patient reported outcome measures (PROMs) in studies conducted subsequent to those included in Gwaltney et al’s 2008 review. Methods A systematic literature review of PROM equivalence studies conducted between 2007 and 2013 identified 1,997 records from which 72 studies met pre-defined inclusion/exclusion criteria. PRO data from each study were extracted, in terms of both correlation coefficients (ICCs, Spearman and Pearson correlations, Kappa statistics) and mean differences (standardized by the standard deviation, SD, and the response scale range). Pooled estimates of correlation and mean difference were estimated. The modifying effects of mode of administration, year of publication, study design, time interval between administrations, mean age of participants and publication type were examined. Results Four hundred thirty-five individual correlations were extracted, these correlations being highly variable (I2 = 93.8) but showing generally good equivalence, with ICCs ranging from 0.65 to 0.99 and the pooled correlation coefficient being 0.88 (95 % CI 0.87 to 0.88). Standardised mean differences for 307 studies were small and less variable (I2 = 33.5) with a pooled standardised mean difference of 0.037 (95 % CI 0.031 to 0.042). Average administration mode/platform-specific correlations from 56 studies (61 estimates) had a pooled estimate of 0.88 (95 % CI 0.86 to 0.90) and were still highly variable (I2 = 92.1). Similarly, average platform-specific ICCs from 39 studies (42 estimates) had a pooled estimate of 0.90 (95 % CI 0.88 to 0.92) with an I2 of 91.5. After excluding 20 studies with outlying correlation coefficients (≥3SD from the mean), the I2 was 54.4, with the equivalence still high, the overall pooled correlation coefficient being 0.88 (95 % CI 0.87 to 0.88). Agreement was found to be greater in more recent studies (p < 0.001), in randomized studies compared with non-randomised studies (p < 0.001), in studies with a shorter interval (<1 day) (p < 0.001), and in respondents of mean age 28 to 55 compared with those either younger or older (p < 0.001). In terms of mode/platform, paper vs Interactive Voice Response System (IVRS) comparisons had the lowest pooled agreement and paper vs tablet/touch screen the highest (p < 0.001). Conclusion The present study supports the conclusion of Gwaltney’s previous meta-analysis showing that PROMs administered on paper are quantitatively comparable with measures administered on an electronic device. It also confirms the ISPOR Taskforce´s conclusion that quantitative equivalence studies are not required for migrations with minor change only. This finding should be reassuring to investigators, regulators and sponsors using questionnaires on electronic devicesafter migration using best practices. Although there is data indicating that migrations with moderate changes produce equivalent instrument versions, hence do not require quantitative equivalence studies, additional work is necessary to establish this. Furthermore, there is the need to standardize migration practices and reporting practices (i.e. include copies of tested instrument versions and screenshots) so that clear recommendations regarding equivalence testing can be made in the future.raising questions about the necessity of conducting equivalence testing moving forward.


Introduction
The implementation of electronic data capture (EDC) in clinical trial settings has become more commonplace as the use of electronic devices in everyday life has become more widespread. Tablets and smart phones are used universally across many age groups [1,2] and prior experience is not a prerequisite for their use [3]. Smart phone subscription is expected to reach 5.6 billion by 2019 [4]. The advantages of using EDC for the administration of patient-reported outcome measures (PROMs) rather than paper-and-pencil administration have been well documented; these include reduction in administrative burden, automatic implementation of skip patterns and scoring, avoidance of secondary data entry errors, time and date stamped data, and fewer items of missing data [5].
The FDA states in its Final PRO Guidance document [6] that the migration of validated paper instruments to electronic platforms should be supported with evidence: "additional validation to support the development of a modified PRO instrument" is required, including when "an instrument's data collection mode is altered", with specific reference to "paper-and-pencil self-administered PRO administered by computer or other electronic device (e.g., computer adaptive testing, interactive voice response systems, web-based questionnaire administration, computer)" (p. [20][21]. There is, however, lack of clarity in the FDA guidance document on the type of evidence required to support PRO to ePRO migrations. As a consequence, the ISPOR ePRO Task Force, led by Stephen Coons, was established to address this issue [7]. This Task Force developed recommendations on how to demonstrate measurement equivalence between electronic and paper-based PROMs, where measurement equivalence refers to the comparability of the conceptual and psychometric properties of the data obtained via the two administration modes [7]. In this respect, the level of modification to the content and format of the paper PROM to produce an electronic version (and, increasingly, between various electronic modes) determines how comparable the two versions are and thus the evidential requirements to demonstrate equivalence between versions.
Coons et al. [7] categorised the magnitude of the modification into three levels, whereby the potential effect on the content, meaning, or interpretation of the measure's items and/or scales is assessed. If a paper-andpencil questionnaire is simply placed into a text screen format without significantly altering item content, recall period or response options, this is considered a minor modification. Minor levels of modification also include going from multiple items per page to one item per screen, for example on a handheld device. The level of evidence required to show equivalence for a minor modification is cognitive interviewing and usability testing.
Where a modification is considered to be moderate, Coons et al. [7] suggest that the modification may result in changes to the (perceived) meaning of the assessment items. Examples of moderate changes include splitting an item into multiple screens (e.g., having a question and its responses on different screens), using a scroll bar to view all item text or responses, and changing the order of item presentation. Where such modifications are made, the level of evidence required would involve conducting quantitative equivalence testing, which evaluates the comparability between PROM scores from the electronic mode of administration and the original mode. The intent is to ensure scores do not vary significantly between modes, barring measurement error. Usability testing is also recommended, to ensure prospective participants experience no issues with the usability of the device. The most common moderate change is from a text based to an interactive voice response system (IVRS). This is considered to be a moderate change because of the difference in cognitive processes involved in responding to an item visually as opposed to aurally.
Substantial modifications occur when significant changes are made to the original assessment, such as changes to the wording or response options. Coons et al [7] suggest that this can fundamentally change the properties of the original instrument and the migrated instrument should be treated as a brand new instrument requiring full psychometric testing.
Prior to the Coons et al.'s [7] framework being established, Gwaltney et al. [8] performed a meta-analysis of equivalence studies (excluding those conducted with IVRS) that had been conducted up until 2006, including studies directly assessing the equivalence of paper and 'computer' versions of PROMs used in clinical trials. As this meta-analysis was conducted before Coons et al.'s [7] recommendations were published, the rationale provided for conducting equivalence testing is broad. The approach that Gwaltney et al. [8] supported, and thus the basis of conducting their meta-analysis, was to provide evidence on quantitative equivalence between modes of administration.
The present study was conducted to provide further evidence on the equivalence between questionnaire scores obtained from paper administration and after migration onto one or more electronic platforms. In order to provide this evidence, a systematic review and metaanalysis was performed on equivalence studies conducted since 2007, i.e., since the conduct of Gwaltney et al's [8] meta-analysis. It was expected that as a consequence of recent advances in technology, the electronic platforms to which the questionnaires are migrated, such as tablets, laptops and smart phones, will be more variable, but that they will be easier to use and will not require prior experience. Ease of use of electronic devices has been shown to result in better compliance and satisfaction [9], therefore reducing potential bias even if respondents are less technologically competent. Thus we hypothesised that the meta-analysis would again show high equivalence scores for instruments migrated to a different administration mode.
Studies that had migrated a questionnaire to an IVRS were also included in the present study; these studies had been excluded from Gwaltney et al's 2008 analysis [8]. IVRS is frequently used in clinical research [10] and it is considered to be a more substantial change to migrate from paper to IVRS than, for example, to a tablet or smart phone [7], and so we sought to explore the equivalence of scores obtained using this platform.
The present study also explores potential publication bias in the literature. It is possible that studies which demonstrate a lack of equivalence are not submitted for publication, thus risking giving a false impression of the success of migration to and between electronic platforms.

Searching
In order to conduct a refined search in this area of literature, the papers that were included in the Gwaltney et al. [8] review were searched for the indexed terms used in three databases: Embase, Medline and PsycInfo. From this list of indexed terms, those terms that were appropriate to re-running the search were highlighted (e.g., questionnaires, microcomputers, mobile devices, crossover design). A list of terms was created under three headings: 'PROMs', 'equivalence' and 'technology'. Using appropriate Boolean operators, these terms were used in separate searches run in the three databases, with limits placed on the last 6 years (Jan 2007 -Dec 2013) and selecting human studies only.
Once the three searches had been run, the results were exported to Reference Manager to amalgamate the abstracts. The search was further refined by searching through the first 100 abstracts to identify any other relevant indexed terms. This refinement was conducted so that current terminology, which may not have shown up in Gwaltney et al. [8], could be used in the new search. After identifying additional search terms, the final search terms were produced and the searches rerun in the three databases. This search yielded 2,271 abstracts. Additional grey literature was examined by searching conference proceedings of relevant conferences (ISPOR and ISO-QOL), the clinical trials registry, and by searching secondary references of articles included in the main search. A further 318 records were identified using this approach.

Inclusion criteria
A number of criteria were specified to select appropriate studies for inclusion in the review and subsequent analysis. To be included, abstracts and full-text papers/posters had to describe a study which (a) was based on the numeric equivalence of questionnaire scores and no other types of equivalence such as conceptual equivalence, (b) include two different modes of administration, (c) administer a PROM, and (d) provide a statistical result of the equivalence of two questionnaires' scores (e.g., intra-class correlation coefficient (ICC), Pearson's correlation, mean difference). The abstract review was conducted by one researcher, who then conferred with another researcher regarding the exclusion of an abstract. Full-text papers/posters were sought for abstracts meeting the criteria. If the abstract suggested that the study might be suitable, but it did not provide details of any of these four criteria, the full-text paper/poster was also sought to assess the study based on these same criteria and to decide whether or not the study should be included. Each full-text paper/poster was then reviewed once by the first researcher, and then a second time by the other researcher, to determine whether or not the study met the inclusion criteria.
The total number of records identified using each of the database and grey literature approaches are shown in the study PRISMA diagram Fig. 1, along with the number of duplicates removed (n = 592), number of articles removed after title only analysis (n = 1502), the numbers of abstracts screened (n = 495) and, of these, the number excluded for one or more of the above reasons (n = 280); the number of full text papers assessed (n = 215) and, of these, the number excluded (n = 143) again for one or more of the above reasons; and the total number of studies meeting the criteria and included in the synthesis (n = 72).

Data analysis Data extraction
For all 72 studies that were included in the meta-analysis, the following data were extracted: (a) name and details of PROM, (b) disease area, (c) study design (parallel groups design or cross-over design), (d) the modes of administration used and details of how these were implemented, (e) mean age and standard deviation (SD) of the participants, (f) the statistic used and the result, and (g) the administration interval. Key features of each study were also identified using a modified data extraction proforma guided by the Strengthening the Reporting of Observational Studies in Epidemiology (STROBE) statement [11]. This data extraction process also served as a critical appraisal process of each study but was not used to exclude any studies from the analysis.
The mean (SD) age of participants was extracted where it was presented. If it was not, then the median age was extracted, or the mean age was calculated from either the presented frequency distribution (with SD also calculated) or the average of minimum and maximum age. Data on the equivalence between the administration modes was extracted for measures of correlation and mean difference. The data on the correlation between questionnaire administrations was extracted as an ICC, Pearson or Spearman correlation coefficient, or a Kappa statistic (weighted or unweighted). Data on mean differences were extracted as a mean difference between administrations with either the presented instrument score standard deviation (SD), standard error (SE), or a 95 % confidence interval (CI), or as separate mean scores for each administration with their own SD, SE, or 95 % CI. The study-specific SD was calculated, where this was not provided, using the sample size and either the SE or 95 % CI. Since it was the magnitude of the difference between administration modes, rather than its direction that is of primary interest, the absolute difference was used in the analysis. This approach also is conservative since it does not allow for positive and negative differences cancelling each other out [8]. Where paired data were available these were used in preference to data from the separate administration groups. Each mean difference was standardized by its extracted SD, meaning a standardized mean difference of 0.5 is a mean difference Fig. 1 Flow chart showing process of identification and selection of studies for synthesis equivalent to 0.5 (half ) of a standard deviation. In addition, since not all studies provided data from which the SD could be calculated, the response scale of each instrument was also extracted (e.g., an instrument scored 0 to 10 has a response scale of 11) and each mean difference standardised by the response scale. Thus, if the mean difference was 2 points on a 100-scale instrument, the standardised mean difference was scored as 2.0 %. This allowed the differences to be measured in terms of scale point difference where information on SD was not available. This was the approach used to compare mean differences by Gwaltney et al. [8].

Data synthesis and meta-analysis
Syntheses were conducted first over all individual measures of correlation and all mean differences calculated within each study (i.e., including multiple measures of agreement per study where these were available, such as for different scales within one instrument and different instruments). The main analyses, however, used only one (average) measure of agreement for each study: the average ICC alone; the average ICC, correlation and/or kappa coefficient in each study where multiple coefficients were presented; and the average scaled mean difference. This ensured that no one study made a disproportionate contribution to the analysis. For all analyses, however, syntheses were achieved using a weighted linear combination of the study estimates so as to give more weight to studies of larger size.
The correlation and standardised mean difference data were synthesized using both the 'metan' command within Stata v12.1 [12] and Comprehensive Meta Analysis (CMA) v2 software [13] which allows multiple types of data (e.g., mean differences) to be synthesised within the same analysis. Fisher's z transformations were applied to the correlations within both Stata and CMA. Standard meta-analytic techniques, however, could not be used for the scale-standardised mean differences, as for these no estimate of SD is provided. Instead, simple means and SDs across individual scale-standardised values were calculated to estimate the average scale-point standardised difference. These estimates were calculated over all individual standardised values and over average standardised values calculated for each study.
The degree of heterogeneity between the study estimates was calculated using the I 2 statistic [14], a measure describing the percentage of total variation across the studies that can be explained by heterogeneity rather than chance. Values of I 2 lie between 0 % and 100 % with the larger the value the greater the heterogeneity; values of 25 %, 50 %, and 75 % have been proposed to indicate low, moderate, and high heterogeneity, respectively [15]. If values of I 2 > 0.75 were identified, random effects models were used to synthesise the individual study estimates; fixed effect models were used otherwise (and to explore the effect of any potential moderating factors). Any potentially outlying studies were identified (those with an effect size more than 3.0 SDs away from the pooled effect) and the I 2 values and pooled effect size recalculated. In exploring the effect of potential moderating variables, fixed effects models were used, with the potential moderating variable treated as a fixed effect. Potential moderating variables considered were, where appropriate: mode of administration/platform (paper vs PC, paper vs PDA, paper vs tablet, paper vs IVRS, PC vs IVRS, tablet vs PC); year of publication (2007, 2008-2010, 2011, 2012-2013; 2007-2010 vs 2011-2013); study design (two variables: randomised cross-over, non-randomised cross-over, within-patient study (a study not formally comparing administration/ platform but in which some patients completed more than one mode), parallel groups (for which only analysis of mean differences was possible); non-randomised vs randomised); time interval between administrations (<1 day, 1-5 days, 6-14 days, 15+ days; <1 day, 1-9 days, 10+ days; <1 day, 1+ days), mean age of participants (<28, 28-46, 47-55, 56+ years), sample size (≤50, 51-100, >100 participants) and publication type (abstract/poster vs full-text paper). The modifying effect of these study characteristics on mean score differences and correlations was explored by calculation of pooled values for studies grouped by these factors (year of publication, study design, mode of administration/platform, time interval between administrations, mean age, sample size and publication type). Analyses of variance, with calculation of Q W and Q B statistics [15], where appropriate, were used to compare estimates between groups of studies.
The likelihood of publication bias was estimated with the use of funnel plots along with Duval and Tweedie's Trim and Fill to estimate the likely number of missing studies (under both fixed and random effects models) and provide estimates of the overall effect after including any identified missing studies. Orwin's fail-safe N was also used, as in Gwaltney et al. [8], to estimate the number of studies required to bring the observed correlation below 0.75, taking the average correlation as the lowest observed individual study correlation.

Study characteristics
Characteristics of all 72 studies meeting the inclusion criteria and included in the meta-analysis are listed in Table 1. Data for four of these studies were available from conference posters and five from abstracts; the remainder from full-text publications. The number of PRO instruments assessed within each study ranged from one to ten, with the number of individual analyses    within each study ranging from one to 60. These instruments included generic measures such as the Short Form 36 Health Survey (SF-36) and condition specific measures such as the Rhino-conjunctivitis Quality of Life Questionnaire (RQLQ); for a full list of the instruments included see Table 1. Studies were conducted in over 23 different population types, with the most frequent population being mental health (n = 15 studies). The studies included data collected from four different electronic platforms [PC, handheld (PDA/smartphone), IVRS, tablet/touch screen], the most commonly used platform being PC (used in n = 43 studies), followed by PDA (n = 14 studies), tablet/touch screen (n = 8 studies) and IVRS (n = 7 studies). The average age of the participants in the studies ranged from 9.58 to 68.3 years, with an overall mean age of 42.9 (SD 17.1) years.

Overall relationship between paper and electronic assessments Mean differences
There were 307 individual estimates of group mean difference (independent group differences or, in preference, paired differences) either with a standard deviation (SD) or with data from which a standard deviation could be calculated. These estimates had low variability with an I 2 of 33.47; the fixed effects pooled estimate of absolute mean difference was 0.037 (95 % CI 0.031 to 0.042). There were 355 individual estimates of group mean differences which could be standardised by the scale score. The mean scale-standardised difference was 0.0180 scale points, i.e., 1.80 % of the score range, (range = 0.00 to 0.13, 0 to 13 %; SD = 0.021) with the upper bound of the 95 % CI (0.015 to 0.020) indicating that the difference in absolute scores between platforms is likely to be at most 2.0 %. The mean difference was within 5 % of the scale score in 93 % of estimates. For the scale-standardised scores averaged over 54 studies with data on mean differences, the mean scalestandardised difference was slightly smaller at 0.0167 scale units (range = 0.001 to 0.058; SD = 0.012), with 95 % CI 0.013 to 0.020. Two of these studies [33,72] had data on different platform comparisons, giving 57 mean differences by study and platform in total (platform-specific comparisons), with a mean of 0.0163 (range 0.001 to 0.057; SD = 0.012), with 95 % CI 0.013 to 0.019, and 97 % having a value within 5 % of the scale score.
Correlations 435 individual correlations were extracted from all 72 studies, these being highly variable, with an I 2 of 93.75 %. The random effects pooled correlation coefficient was 0.875 (95 % CI 0.867 to 0.884). Correlations averaged over the values in each of 56 studies with available data (one study providing values for two different platform comparisons [33] and two studies three different comparisons [20,72]; i.e., 61 platform-specific values in total) are shown in Fig. 2

Analysis of moderator variables Mean differences
In terms of factors which might explain the observed heterogeneity, for the 307 individual standardised mean differences (data shown in Table 2 Values from studies comparing paper with tablet devices appeared to have the greatest level of agreement (p < 0.001).
In terms of study design, agreement was greater in the 256 values from randomised studies compared with the 51 values from non-randomised studies and in crossover studies compared with within-patient and parallel group studies (p < 0.001). Studies with a longer interval between administrations and 56 or fewer participants had lower levels of agreement (p = 0.077). In terms of participant age, the 84 values from studies with a mean of <28 years had the lowest agreement, and the 40 values from studies with a mean of 28-46 years the greatest, p < 0.001. There was no significant association with publication type (Table 2).
Using the 57 scale-standardised mean differences averaged across each study and platform, mean(SE) differences were significantly lower (i.e., agreement greater) in the 25

Correlations
Using the 61 correlations averaged across each study and platform (data shown in Table 3), there was a difference in pooled correlation estimates between studies grouped by publication year, with agreement in earlier years, particularly in 2007, being lower (fixed effects p < 0.001). The design of the studies was also significantly associated with the degree of correlation, with the highest agreement being observed in randomized studies and the lowest in non-randomised studies (p < 0.001). In terms of platform, 8 studies compared a paper with an IVRS measure, 34 a paper with a PC measure, 10 a paper with a PDA measure, and 7 a paper with a tablet/ touch screen measure. The paper vs IVRS comparisons had the lowest pooled agreement and the paper vs tablet the highest. In terms of the time period between administrations, agreement decreased as the time interval increased (p < 0.001). The age of the participants also had a significant association with agreement, with the   Year of publication youngest participants (those aged <28 years on average) having the lowest agreement but other age groups generally having comparable levels of agreement. While study size had no significant association with agreement, there was a significant association with publication type, with data extracted from 51 full-text publications having lower levels of agreement than data extracted from 10 abstracts/posters (p < 0.001). Relationships assessed using all available 435 correlations were similar, although the association with sample size, with smaller studies having greater agreement, was statistically significant ( Table 3). Using an average correlation of 0.65 for potentially missing studies, this being the lowest ICC extracted [77], Orwin's fail-safe N test estimated that 123 missing studies additional to the 61 (79 for the 42 estimates after excluding the outliers) would be needed to bring the observed pooled estimate to <0.75.

Discussion
The results summarised here indicate that electronic and paper PROMs and different modes of electronic administration produce equivalent scores across a wide range of scenarios (medical conditions and platforms), suggesting that electronic measures can generally be assumed to be equivalent to pen and paper measures. In particular, given the generally high level of agreement across all studies included in this review, there is no evidence that equivalence is compromised by the nature of the condition under investigation, even when the information collected is of a sensitive nature, such as of sexual function [34], sexual health [21,22], sexual behaviour [45], IBS [50] and IBD [52]. Further analyses exploring the role of measurement domain (e.g. physical or mental health) will be reported in another paper. Of particular note is the fact that, based on the ICCs and the numerically small mean score differences, pen-and-paper scores are equivalent to scores obtained from a variety of electronic platforms -IVRS, handheld, PC, and tablet. While equivalence between paper and IVRS measures appears to be slightly lower than with most other forms of electronic measure (pooled correlation coefficient 0.85 vs 0.89 for paper vs tablet; pooled standardised mean difference 0.053 vs 0.020), the data suggest that the likely true agreement  [20] were randomly assigned to complete 2 versions of 1 of 4 instruments b Four studies [7,23,48,71] did not provide information on the age of their participants (lower 95 % CI) between paper and IVRS measures is at least 0.82 and thus that there is at least good agreement between data obtained from IVRS and pen and paper measures. This is reassuring given that migration from paper to an IVRS is considered to be a moderate change because of the difference in cognitive processes involved in responding to an item aurally as opposed to visually. These results are also consistent with the results from a recent large study (N = 923 adult participants) of the effects of method of administration (paper, PDA, PC, IVRS) on the measurement characteristics of items developed in the Patient-Reported Outcomes Measurement Information System (PROMIS) which strongly supported measurement equivalence across all platforms [88].
The observed mean differences in PROM scores between administration types were small. Taking all mean differences as positive differences, the fixed effects pooled standardized mean difference (mean difference standardised by the SD) of the 307 estimates was 0.037 (95 % CI 0.03 to 0.04). These estimates were also of low variability, with an I 2 of 33.5. In other words, the average mean difference in scores between electronic and paper measures was small at approximately 0.04 SDs. No comparison with earlier data is possible as Gwaltney et al. [8] did not report on standardised mean differences. Standardising the mean differences by the scale range (rather than the score SD), this difference was equivalent to a scale-standardised mean score difference of 1.8 % or, Fig. 3 Assessment of publication bias among correlation coefficients averaged over study/platform under a random effects model from the upper bound of the 95 % CI, a difference of at most 2 %. This is consistent with, or slightly smaller than, the 2 % mean scale-standardised difference reported by Gwaltney et al. [8]. Similarly, 93 % of all mean differences in this study were within 5 % of the scale score, exactly the same percentage as reported by Gwaltney et al. [8]. The values were similar when study and platform averaged scale-standardised estimates were used: the 57 values had a mean of 0.0163, with 95 % CI 0.013 to 0.019, and 97 % having a value within 5 % of the scale score.
In terms of ICCs and correlation coefficients, agreement was again high, with a pooled ICC over 42 study-specific estimates of 0.  [8], an estimate which was the same irrespective of the specific measure of correlation. There is thus little evidence from both the present study and the earlier one [8] that the measure of correlation used has any influence on the degree of equivalence obtained. This is reassuring given the number of studies not employing the ICC in their assessment of equivalence. The ICC is the statistically correct measure of equivalence when agreement is assessed within (i.e., intra) measures sharing the same metric (i.e., mean and standard deviation); the Pearson correlation (an interclass correlation) is appropriate only when the measures are of a different class and not sharing the same metric [89]. It is also worth noting that not all studies identified in this review employing the ICC, stated which of the six possible ICCs, as described by Shrout and Fleiss in 1979 [90], was employed: whether the model was one-way or two-way, random or mixed, applying to single or average measures, or measuring consistency or absolute agreement. The value of the ICC obtained will depend on the specific model chosen. A full description of the nature of different ICCs is provided by McGraw and Wong, 1996 [89].
The correlation estimates were highly variable in both the current study and Gwaltney et al. [8], with the I 2 in the current study being >90 %. After excluding outliers, however, the pooled estimates were essentially unchanged. In terms of factors which might explain the observed heterogeneity, agreement was greater in studies reported most recently (2011-2013 vs 2007-1010), in randomised as opposed to non-randomised studies, in studies with an interval between administrations of <1 day (and, overall, the greater the interval the lesser the agreement), and in studies of larger size. In addition, studies including very young children were associated with lower levels of agreement. While these associations were generally of high statistical significance (p < 0.001), they were small in magnitude indicating that these factors have only small, albeit precise, effects; agreement is generally high even in those studies with the lowest agreement. Nevertheless, the patterns observed highlight the importance of appropriate study design when assessing equivalence: randomised studies and those with a shorter interval between administrations were associated with greater equivalence, this effect greatest in studies with an interval of fewer than 10 days between administrations. The lower levels of agreement observed in younger individuals (<28 years) may to some extent reflect this effect: four [45,54,68,78] of the five studies [30,45,54,68,78] conducted in younger individuals with the lowest level of agreement (ICC < 0.80) had intervals between administrations of one week or more.
The same was true of mean differences: average scalestandardised mean differences were lower (agreement higher) in more recent years (2011-2013) compared with earlier years (2007)(2008)(2009)(2010), and randomised studies were associated with greater agreement than nonrandomised studies, with the pooled standardised mean difference being 0.035 (95 % CI 0.030 to 0.041) vs 0.065 (95 % CI 0.046 to 0.084), p = 0.003. Other design features associated with agreement were the interval between administrations, with agreement being better (mean difference lower and correlation higher) in studies with an interval of <1 day; and mean age of participants, with agreement being better in studies with participants of mean age between 28 and 55 years. Studies in either younger (some studies having participants of mean age <13) or older participants tended to have lower levels of agreement, this consistent with lower levels of familiarity with EDC platforms in the older age group, and perhaps some unreliability in the responses in general from very young children. By definition, correlation coefficients cannot be obtained from parallel group studies; for the 7 estimates from parallel group studies the scale-standardised mean difference was 1.83 % compared with 1.55 % for the 35 estimates from randomised cross-over designs.
Gwaltney et al. [8] also found substantial heterogeneity in their extracted estimates of equivalence and were unable to explain the variability with analysis of the moderating factors (age and computer familiarity). Nevertheless, in this study only 9 of the studies in this analysis reported a correlation that was less than 0.80. Furthermore, this study found little evidence of publication bias; no studies with correlations less than the pooled mean were identified as missing. The identification of 11 possible missing studies with correlations greater than the pooled mean may simply be a reflection of heterogeneity in the data. Finally, as many as 123 studies with a correlation of <0.75 would need to have been conducted and not published in order for the overall effect to have been <0.75. This figure of 123 was greater than the 95 studies similarly estimated by Gwaltney et al. [8] suggesting that the more recent studies are more robust than those identified in the earlier review. There is thus no reason to believe that heterogeneity, and any possible publication bias, should temper the conclusions drawn from this meta-analysis.
In terms of study design, the general critical appraisal process of each study identified some issues which should be taken into account in future studies. For example, only a small proportion of studies (n = 18, 25 %) reported on the use of a power calculation when planning the study size and fewer than half used 95 % CIs (n = 29, 40 %) in result reporting. These issues relate to the importance of ensuring that the study is large enough to have sufficient power so that the estimated equivalence effect is estimated with sufficient precision so that possible lack of equivalence can be ruled out (i.e., by the 95 % CI excluding all values indicating measurement non-equivalence). Similarly, while it is encouraging to note that parallel studies assessing measurement equivalence are becoming less frequent (of the 7 parallel group studies, 4 (57 %) were reported in the two years from 2007 to 2008, see Table 1), and while the majority of studies identified (n = 51, 70.8 %) were randomised cross-over studies, in which participants completed both versions of the PROM in randomly allocated order, only 8 of these [20,51,54,[59][60][61][62]72] undertook the equivalence assessment in the context of a full, or almost full, factorial assessment of instrument equivalence. Such full assessment requires the comparison of scores among four groups of respondent: those completing electronic first and then paper (E-P), those completing paper first and then electronic (P-E), those completing two paper versions (P-P), and those completing two electronic versions (E-E). Such assessment, with appropriate statistical analysis (the formal statistical analysis of these 8 studies generally did not, however, capitalise on the study design) allows the expected variability in scores between measures completed on the same platform on two occasions (i.e., test-retest reliability) to be 'subtracted' , in the context of an analysis of variance, from the variability observed between measures completed on different platforms. This reflects the fact that it is clearly nonsensical to require a greater degree of measurement equivalence between measures on different platforms than is required between one measure on the same platform in the context of the assessment of test-retest reliability: at best the same degree of equivalence should be required.
Such considerations also raise questions about the inherent expectation of equivalence built into such studies. With the documented strengths of electronic modes of administration over paper [5] one might rightly anticipate a quantitative difference in the data captured on different modes of the same questionnaire due to the simple fact that there is better quality data being captured on the electronic system (e.g., fewer items of missing data, no out of range data). The current approach to equivalence studies seems to demand comparability between superior (electronic) and inferior (paper) modes of data capture which risks undermining the true advantages EDC bring to an actual clinical trial over the necessarily artificial setting of the equivalence study.

Conclusion
The present study strongly supports the conclusion of Gwaltney et al. [8] that PROM data obtained from electronic platforms are comparable to that obtained from paper administration, as well as providing data on the equivalence of PROMs migrated to an IVRS platform, data not included in the earlier Gwaltney et al. study [8]. The high level of agreement seen in this review as well as in the Gwaltney et al review [8] should be reassuring to investigators, authorities and sponsors using electronic devices to collect PROM data, having implications for the use of electronic measures generally and in clinical trials in particular.
Given the weight of the evidence for the equivalence between paper and electronic versions, we propose that equivalence studies should not be necessary to demonstrate the equivalence of validity of a measure that has been migrated to an electronic platform following best practices [7] with minor changes as defined in the ISPOR Taskforce report [7]. These results also suggest that a migration following best practices [7] to an IVRS may not need an equivalence study. Further research into migration principles and standards for IVRS may be needed to support our findings.
This conclusion stands even when estimates of possible unpublished studies are included in our analysis, highlighting the robust nature of instruments migrated from paper onto electronic platforms. We further recommend that common best practices are established among the vendor community (i.e. via the ePRO consortium) to standardize migration principles (i.e. number of items per screen, scrolling through answer options) as well as to define a standard framework for the conduct and publication of equivalence studies.