Beyond the Randomized Controlled Trial: A Review of Alternatives in mHealth Clinical Trial Methods

Background: Randomized controlled trials (RCTs) have long been considered the primary research study design capable of eliciting causal relationships between health interventions and consequent outcomes. However, with a prolonged duration from recruitment to publication, high-cost trial implementation, and a rigid trial protocol, RCTs are perceived as an impractical evaluation methodology for most mHealth apps. Objective: Given the recent development of alternative evaluation methodologies and tools to automate mHealth research, we sought to determine the breadth of these methods and the extent that they were being used in clinical trials. Methods: We conducted a review of the ClinicalTrials.gov registry to identify and examine current clinical trials involving mHealth apps and retrieved relevant trials registered between November 2014 and November 2015. Results: Of the 137 trials identified, 71 were found to meet inclusion criteria. The majority used a randomized controlled trial design (80%, 57/71). Study designs included 36 two-group pretest-posttest control group comparisons (51%, 36/71), 16 posttest-only control group comparisons (23%, 16/71), 7 one-group pretest-posttest designs (10%, 7/71), 2 one-shot case study designs (3%, 2/71), and 2 static-group comparisons (3%, 2/71). A total of 17 trials included a qualitative component to their methodology (24%, 17/71). Complete trial data collection required 20 months on average to complete (mean 21, SD 12). For trials with a total duration of 2 years or more (31%, 22/71), the average time from recruitment to complete data collection (mean 35 months, SD 10) was 2 years longer than the average time required to collect primary data (mean 11, SD 8). Trials had a moderate sample size of 112 participants. Two trials were conducted online (3%, 2/71) and 7 trials collected data continuously (10%, 7/68). Onsite study implementation was heavily favored (97%, 69/71). Trials with four data collection points had a longer study duration than trials with two data collection points: F4,56=3.2, P=.021, η 2=0.18. Single-blinded trials had a longer data collection period compared to open trials: F2,58=3.8, P=.028, η 2=0.12. Academic sponsorship was the most common form of trial funding (73%, 52/71). Trials with academic sponsorship had a longer study duration compared to industry sponsorship: F2,61=3.7, P=.030, η 2=0.11. Combined, data collection frequency, study masking, sample size, and study sponsorship accounted for 32.6% of the variance in study duration: F4,55=6.6, P<.01, adjusted r 2=.33. Only 7 trials had been completed at the time this retrospective review was conducted (10%, 7/71). JMIR Mhealth Uhealth 2016 | vol. 4 | iss. 3 | e107 | p.1 http://mhealth.jmir.org/2016/3/e107/ (page number not for citation purposes) Pham et al JMIR MHEALTH AND UHEALTH


Introduction
With over 165,000 mobile health (mHealth) apps on the Apple App Store and Google Play Store catalogues and 3 billion downloads in 2015 alone [1], mHealth apps represent a mature, robust marketplace for a new generation of patients who seek patient-empowered care and mHealth publishers who aim to facilitate this practice. mHealth apps are currently being developed for many different clinical conditions including diabetes [2], heart failure [3], and cancer [4], and have the potential to disrupt existing health care delivery pathways.
In recent years, numerous calls have been made to address the challenges inherent in mHealth app evaluation [5][6][7]. Key barriers were identified by researchers at the National Institutes of Health mHealth Evidence Workshop, notably the difficulty of matching the rapid pace of mHealth innovation with existing research designs [8]. Explicit attention was drawn to the randomized controlled trial (RCT), which has long been considered the primary research study design capable of eliciting causal relationships between health interventions and consequent outcomes [9]. However, RCTs are notoriously long-the average duration of 5.5 years from enrollment to publication clearly risks app obsolescence occurring before study completion [10]. With high-cost trial implementation and a rigid protocol that precludes mid-trial changes to the intervention in order to maintain internal validity, RCTs are perceived as an incompatible, impractical evaluation methodology for most mHealth apps [11][12][13][14][15]. There is also an inherent quality of software that does not lend itself to the rigidity of the RCT-software is meant to change, evolve, progress, and learn over time, all at a rapid pace. Rigid trial protocols undermine this principle attribute, since controlled trials were designed for interventions that take years, even decades to develop, that is, medical devices and drugs. In concluding the mHealth Evidence Workshop, researchers identified the need to develop novel research designs that can keep up with the lean, iterative, and rapid-paced mHealth apps they seek to evaluate.
The Chicago-based Center for Behavioral Intervention Technologies has endeavored to design methodological frameworks that can appropriately support mHealth evaluation. Mohr and colleagues proposed the Continuous Evaluation of Evolving Behavioral Intervention Technologies (CEEBIT) framework as an alternative to the gold-standard RCT [16]. The CEEBIT methodology is statistically powered to continuously evaluate app efficacy throughout trial duration and accounts for changing app versions through a sophisticated elimination process. The CEEBIT also thoughtfully addresses many other RCT-specific considerations, from randomization to inclusion/exclusion criteria to statistical analysis.
Additional alternatives to the RCT have also been presented, including interrupted time-series, stepped-wedge, regression discontinuity, and N-of-1 trial designs that may limit interval validity but are more responsive and relevant for evaluating mHealth interventions [8]. Novel factorial trial designs have been proposed for mHealth research and are increasingly being used to test multiple app features and determine the optimal combinations and adaptations to build an effective app. These include the multiphase optimization strategy (MOST) [17], the sequential multiple assignment randomized trial (SMART) [18], and the microrandomized trial [19]. Suggestions have also been made on how to increase the efficiency of traditional RCTs themselves, including using within-group designs, fully automating study enrollment, random assignment, intervention delivery and outcomes assessment, and shortening follow-up through modeling long-term outcomes [13]. Further, best practice evaluation methods in the field of human-computer interaction, notably usability testing and heuristic evaluation, have been widely adopted in mHealth research and are well suited to assess the efficacy of user-driven, digitally operationalized behavioral mechanisms required to elicit stable changes in health outcomes [20][21][22]. These alternatives allow us to reconsider the RCT for a more flexible and iterative evaluation approach that will mimic the attributes of software-based behavioral interventions and their agile app development process, where it is acceptable and preferable to learn from a poor trial outcome sooner in order to redesign the intervention more quickly and subsequently show success sooner.
In parallel to the development of novel research designs like the CEEBIT, new industry initiatives have also introduced novel platforms to deploy mHealth evaluations. In 2015, Apple announced the release of ResearchKit, a software framework designed for health research to allow iPhone users to participate in research studies more easily [23]. ResearchKit allows for the digital collection of informed consent, a process that has historically hindered the accrual of patients into trials and the scalability of clinical research. It also enables access to real-time data collected from the iPhone's accelerometer, gyroscope, microphone, and global positioning system (GPS), along with health data from external wearables (eg, FitBit, Apple Watch) to gain real-time insight into a participant's health behaviors [24]. Evidence of ResearchKit's impact can already be seen in several Apple-promoted research trials deployed for a range of conditions [25][26][27]. It is not difficult to imagine ResearchKit being adapted for use as a tool to evaluate mHealth app efficacy-an app claiming to help patients self-manage their diabetes could be launched using the ResearchKit framework and evidenced for efficacy through sensor data and in-app surveys.
Given the development of alternative evaluation methodologies and the launch of novel technologies to automate mHealth research, we sought to determine if these initiatives were being implemented in current clinical trials. Through this review, research designs and methods for current mHealth clinical trials were identified and characterized in an effort to understand the views of the field toward novel frameworks for evaluating mHealth apps.

Methods
A review of the ClinicalTrials.gov registry was conducted in November 2015 to identify and examine current clinical trials involving mHealth apps. The following search terms were trialled in a scoping search to optimize the search strategy: mobile application, mobile heath app, mobile health application, mobile app, smartphone application, and smartphone app. A Boolean search was then conducted with all these search terms combined ("mobile application" OR "mobile heath app" OR "mobile health application" OR "mobile app" OR "smartphone application" OR "smartphone app".) However, upon comparing the search results generated from all scoping searches, the search term "mobile application" independently yielded a higher number of results compared to the Boolean search. A precautionary decision was made to use "mobile application" as the sole search term to retrieve relevant trials registered between November 19, 2014, and November 19, 2015-a 1-year period before this review was initiated. The titles and abstracts of retrieved trials were assessed for inclusion, followed by a complete review of the entire trial registration. Following the final identification of trials to include in our review, we conducted a reverse search of each trial to determine whether it would have been found through our initial Boolean search and concluded that a small number of relevant studies would have been omitted. We therefore recommend the use of "mobile application" as the preferred comprehensive search term for those looking to duplicate our search strategy.
All trials were included if they (1) evaluated mHealth apps, (2) measured clinical outcomes, and (3) were deployed exclusively on a mobile phone as a native app and not a Web-based app.
Trials were excluded if (1) they evaluated mHealth apps that solely received text messages (short message service [SMS] or multimedia messaging; this was done due to a large amount of existing trials for SMS-based interventions in the literature) or phone calls as their primary behavior change modification, (2) the mHealth app was a secondary intervention or the study mixed mobile and non-mobile interventions, (3) the mHealth app was solely an appointment reminder service, and (4) the mHealth app did not require user input through active or passive (sensor) data entry.
Following the identification of studies that met inclusion criteria, trial data were extracted from the ClinicalTrials.gov website and coded according to relevant outcome variables. All data were collected directly from the registry, where trial information was originally reported and categorized by the investigators conducting the trials. Extracted data measures included trial identification, app name, study purpose, trial sponsor, targeted condition, data collection duration, data collection points, study duration, sample size, study type, control and masking methods, random allocation, group assignment, study site, qualitative components, app availability, and study design. Table 1 lists all measures that were manually coded into categories from extracted data alongside their codes. A differentiation was made in coding "data collection duration," defined as the amount of time allotted for primary data collection as specified in the outcome measures section of each ClinicalTrials.gov record detail, and "study duration," defined as the amount of time between initial recruitment and complete data collection as specified by the "estimated study completion date" in the trial record detail. Studies were coded as being onsite if participants had any direct face-to-face contact with a member of the research team, and online if recruitment and follow-up data collection were done remotely-if a participant was recruited in a hospital setting but follow-up data were collected through the study app, this was coded as onsite implementation. Targeted conditions were further coded into parent condition categories for analysis. All identified app titles were also searched on public app stores (ie, Apple App Store, Google Play Store) to confirm whether they were available for public download.

Data Analysis
Descriptive statistics were first conducted on all variables to identify methodological data trends and parameters. In reference to Campbell and Stanley's experimental and quasi-experimental designs for research [28], measures of whether trials collected pretest or baseline data, and also the number of data collection points throughout the trial, were recorded. This was done to identify specific study designs and assess the range of study designs deemed suitable for mHealth app evaluation.
While the focus of this review was to provide an overview of the study designs and methodologies currently being employed for mHealth research, we were also interested in exploring the relationships between methodological variables, specifically identifying potential predictor variables for study duration. We first conducted independent t tests and one-way independent analyses of variance (ANOVA) to determine whether there were differences in study duration for the following categorical methodological variables: study sponsorship, clinical condition, pretest data collection, data collection frequency, presence of a control group, study purpose, presence of randomization, study group assignment, qualitative data collection, and app availability. We then performed a Pearson correlation analysis to test for a correlational relationship between sample size and study duration. These preliminary analyses were conducted to determine which variables were appropriate for inclusion in a multiple linear regression analysis. The assumptions of linearity, normality, independence of errors, and homoscedasticity were met, and diagnostic tests to check for outliers, homogeneity of variance, and multicollinearity were passed. The regression was then performed with study duration as the dependent variable and all significant predictor variables from our preliminary analyses as independent variables. Extreme outlier data were excised prior to analysis, leaving a dataset that included 64 trials (90%, 64/71), each with a sample size of 500 participants or less. Statistical significance was considered at P<.05 unless otherwise specified. All statistical analyses were conducted using SPSS Statistics version 22 (IBM Corporation).

General Characteristics
Of the 137 trials identified, 71 were found to meet inclusion criteria. Table 2 details each included trial and outlines their general characteristics. Key highlights include the ClinicalTrials.gov study identification, app name, target condition, sample size, and study duration.

Descriptive Characteristics
Data collection duration was relatively short on average (median 6 months, IQR 8) with the majority of trials having a data collection period of 6 months or less (72%, 51/71). However, the range of duration was broad, with the shortest data collection period lasting 10 days and the longest period lasting 4 years.
Study duration was 20 months on average (mean 21, SD 12); researchers continued to collect secondary data for nearly a year after they had completed their primary data collection (median 12, IQR 13). This discrepancy between study duration and data collection duration was more pronounced in studies with a total duration of 2 years or more (31%, 22/71) where the average time from recruitment to complete data collection (mean 35, SD 10) was 2 years longer than the average time required to collect primary data (mean 11, SD 8 Nearly three-quarters of the trials (72%, 51/71) had official app names, which suggested that they were positioned for commercialization or were already available on the market. However, only 17 apps (24%, 17/71) were publicly available for download as of December 2015. Academic sponsorship was the most common form of trial funding (73%, 52/71), followed by an academic-industry collaboration (18%, 13/71) and industry sponsorship (9%, 6/71).

Methodological Analysis
Our preliminary t tests and ANOVAs to determine whether differences existed in study duration across methodological variables revealed three  Table 5. A correlation analysis of the relationship between sample size and study duration revealed a positive but weak correlation between both variables: r=.25, P=.044. Based on this finding, we included sample size as a predictor variable in our multiple linear regression model for predicting study duration alongside data collection frequency (two versus four or more data collection points), masking (open versus single-blinded), and study sponsorship (academic versus industry). The focus of this analysis was prediction, so we used a stepwise method of variable entry. The results of our regression analysis indicated that all four of our predictors combined accounted for 32.6% of the variance in study duration: F 4,55 =6.6, P<.01, adjusted r 2 =.33. Data collection frequency alone, specifically the difference between two and four or more data collection points, was able to explain 11.5% of the variance in study duration. Together with the difference between single versus open masking, these variables explained 19.7% of the variance in study duration. Sample size added 6.7% to the explanation of variance in study duration, and the difference between academic and industry sponsorship added another 6.2%. Each step in the model added significantly to its predictive capabilities. Based on this model, the prediction equation is as follows: 13.79 + 10.71*(two versus four or more data collection points) + 6.88*(single versus open masking) + 0.04*(sample size) -12.00*(industry versus academic sponsorship). Table 6 presents the regression coefficients and standard errors for each of the four significant predictors.

Principal Findings
Our review has shown that the overwhelming majority of mHealth researchers are continuing to use the RCT as the trial design of choice for evaluating mHealth apps. The consistent use of RCTs to demonstrate efficacy across disparate clinical conditions suggests that researchers view this design to be condition-agnostic and truly the gold standard for any clinical trial evaluating app efficacy. While trials of apps for managing obesity did not adhere to a two-group pretest-posttest control group comparison design as defined by the Campbell and Stanley framework, and only a third of mental health apps used this classic RCT design, the majority of trials for other prevalent conditions did favor this specific study design to evaluate health outcomes and elicit proof of app efficacy. This homogeneity of study designs within the framework suggests that researchers are not adapting designs to align with the unique qualities inherent in the mHealth apps they are evaluating.
Some unexpected findings emerged from our review, one being the near-complete lack of variation in study implementation sites-97% of trials were conducted onsite in academic centers and hospitals, with only two trials employing online recruitment and data collection. Regarding trial duration, mHealth trials had a total data collection period of 20 months on average. We were able to identify four predictor variables that accounted for 32.6% of the variance in trial duration: data collection frequency, masking, sample size, and study sponsorship.
Our analysis of the relationship between the number of data collection points in an mHealth trial and the duration of the trial revealed that trials with four or more data collection points would have a significantly longer data collection period compared to trials with two data collection points. While this finding suggests that mHealth trials might benefit from a study implementation process that includes automated data collection through the intervention app to allow for frequent data collection without prolonging study duration, our review results are inconclusive in supporting this recommendation given the lack of a clear relationship between study length and data collection frequency. In analyzing the raw review data, there is no significant difference in study duration between one, three, and four or more data collection points, and trials with one data collection point are similarly long in duration compared to trials with four or more data collection points. With this in mind, we are cautiously optimistic in our advocacy of automated study implementation, from recruitment to data collection, for all mHealth trials. While many trials had open masking, nearly a third chose to blind their participants or outcomes assessor, and four trials even went as far as to double-blind both participant and investigator. This level of rigor was unanticipated for a field that has been criticized for a lack of evidence demonstrating efficacy and impact [29]. We were surprised to find that single-blinded trials were significantly longer in duration compared to open trials. However, given the dearth of empirical evidence to support the role of double blinding in bias reduction [30] and the inconclusive nature of our raw data, which did not show an increase in study duration between open and double-blinded trials, more data are required to investigate this relationship prior to discounting the value of masking in favor of shorter trials.
Despite the fact that the majority of reviewed trials were funded by academic research grants, industry-academic partnerships were not uncommon and suggest that industry publishers have realized the potential of engaging with academic institutions to bolster the credibility of their apps. However, these partnerships warrant particular attention given past lessons learned from duplicitous investigative behavior exhibited by industry-funded research teams [31]. Our review results revealed that industry-funded mHealth trials were significantly shorter in duration than their academic counterparts. A potential explanation for this difference in study duration is the use of study outcomes in industry trials that are more sensitive to short-term changes (eg, quality of life, frequency of desired health behaviors, engagement with mHealth app) over outcomes with a longer trajectory towards measurable change (eg, frequency of emergency department visits, quality-adjusted life years, mortality). These trials may also be bound by competitive industry-led timelines, which dictate how long an app can spend in research and development before it must be released to generate profit-a concern that is shared but not equally prioritized in academic mHealth app development. It is apparent that industry-funded mHealth trials differ from purely academic pursuits in both research objectives and anticipated outcomes, making efforts to maintain methodological rigor and increase the transparency of industry-academic collaborations a critical endeavor as these relationships grow in popularity.
It is very clear that only a fraction of publicly available apps are evaluated [32], and our identification of 71 mHealth trials initiated over a 1-year period is in stark comparison to the tens of thousands of unevaluated apps publicly deployed during the same time period. While the mHealth trials we reviewed were methodologically rigorous, it was obvious that the methods themselves have not changed: not once in the registration of any mHealth clinical trial was the CEEBIT methodology mentioned, nor alternate methodologies that have been identified as more suitable for mHealth evaluation. The mobile phone platform on which mHealth apps are hosted is not being leveraged through initiatives like ResearchKit to improve recruitment for large sample sizes or to passively collect data with built-in sensors. This is unfortunate given the opportunity to explore and build upon mobile phone capabilities for research purposes. It was also unclear how trials with data collection periods of 2 years or more would maintain the relevance of their findings.
From our preliminary results, it appears that investigators conducting mHealth evaluations are applying positivistic experimental designs to elicit causal health outcomes. This insight is a cause for concern because it neglects to consider that (1) mHealth apps are complex interventions [33] and as such, (2) mHealth apps might therefore be fundamentally incompatible for evaluations founded on purely positivistic assumptions [34].
In addressing the first point, mHealth apps may simply be software programs on a mobile phone, but they have personal and social components that prove unstable when they are forced to be defined and controlled [35]. mHealth researchers should acknowledge that app users may intend to use technology for improved health but also exhibit unpredictable behaviors of poor compliance, deviant use, and in rare cases even negligence. This will affect both internal and external validity of traditional trials looking to prove direct causation.
To illustrate our second point, various positivistic assumptions regarding mHealth apps should be considered. A positivistic researcher might state that mHealth apps affect a single reality that is knowable, probabilistic, and capable of being objectively measured. They might think it is reasonable to make generalizable statements about the relationship between the app and consequent health outcomes. They might then assume a methodological hierarchy of research designs to validate this reality, with quantitative experimental studies being seen as the most robust, for which the RCT is the gold standard. While this viewpoint is evidently endorsed by the majority of mHealth researchers whose work was identified in this review, it has not been justified in practice due to the challenge of isolating the relationship between the user and the specific mHealth app being evaluated [14]. The hallmark of the RCT is its ability to control for contextual variables in order to only measure causal impact between independent and dependent variables. However, mHealth evaluations that implement an RCT methodology are often forced to engage in trade-offs that breach RCT protocol but increase the usage and adherence rates critical to study implementation [36]. mHealth researchers have recognized a host of research implementation barriers, from the deployment environment, to app bugs and glitches, to user characteristics and eHealth literacy [37]. It is arguably easier to prevent patients from taking a drug that might interfere with their health outcomes in a pharmaceutical trial than it is to prevent patients from using an alternative diabetes management app or reading about diabetes management strategies on a website during an mHealth trial. Finally, of the trials we reviewed, the apps we evaluated were not simple and static; they were sociotechnical systems [38] that were robust in functionality and provided timely, continuous, and adaptable care personalized to the needs of their users. If we ignore these natural attributes in evaluating apps and remain wedded to traditional research designs that view these strengths as confounders, we will fail to capture the complex technological nuances and mechanisms of change facilitated by apps [39] that can impact positive health outcomes.

Limitations
In addressing the limitations of our review, we must acknowledge the rapid pace at which mHealth trials are being registered to ClinicalTrials.gov. In the 5 months following our initial search, 31 new trials had been added to the registry that met our inclusion criteria. On initial assessment, these trials are in line with our review findings. The majority adhere to a classic two-arm RCT trial design, target a range of complex chronic conditions, and are on average 2 years in duration. We aim to update our review in 6-month intervals to capture the high volume of incoming mHealth clinical trials.
Our study duration calculation was based on the "study start date" and "study completion date" fields reported by researchers on ClinicalTrials.gov. We recognize that in using study duration as the primary dependent variable for analysis, we are subjecting our results to the inherent variability of prospectively estimated study durations, which may differ greatly from actual study durations reported post trial. To address this limitation in the reliability of our data, we will monitor the status of all reviewed trials as they move toward completion and update our results to reflect any significant divergences between estimated and actual study duration.
Due to time and resource constraints, we did not perform an exhaustive search of all mHealth trials that had published either manuscripts or protocols in the literature during our 1-year search period. Our decision to have a sampling method solely focused on a single trials registry may have resulted in a biased identification of trials with more traditional positivist methods-this is also suggested by how the trials we reviewed were largely academically sponsored. We acknowledge that the trials registered on ClinicalTrials.gov do not make up the sum total of mHealth research. There is a large body of mHealth evaluative work that is not registered on ClinicalTrials.gov, notably apps that have engaged in usability testing and feasibility pilot studies but have not undergone formalized clinical research [22,[40][41][42][43][44], as well as direct-to-consumer apps that publish evaluative reports of their in-house testing online but do not submit their work for review through formal research channels [45][46][47]. As such, our findings on the homogeneity of mHealth clinical trial methods are limited to trials registered on ClinicalTrials.gov. We aim to conduct a more systematic search of the mHealth literature and also search additional mobile app store catalogues (ie, Windows, Samsung, Blackberry) for publicly available trial apps in a future review to improve the representativeness of our findings.

Conclusion
It is clear that mHealth evaluation methodology has not deviated from common methods, despite the issues raised. There is a need for clinical evaluation to keep pace with the rate and scope of change of mHealth interventions if it is to have relevant and timely impact in informing payers, providers, policy makers, and patients. To fully answer the question of an app's clinical impact, mHealth researchers should maintain a reflexive position [35] and establish feasible criteria for rigor that may not ultimately result in a positivist truth but will drive an interpretive understanding of contextualized truth. As the mHealth field matures, it presents the challenge of establishing robust and practical evaluation methodologies that further foundational theory and contribute to meaningful implementation and actionable knowledge translation-all for optimized patient health and well-being.