Applications of Federated Learning in Mobile Health: Scoping Review

Background: Advances in sensing and other communication technologies, as well as artificial intelligence (AI), have driven the widespread use of mobile health (mHealth) applications. For example, data collected from sensor devices carried by patients can be mined and analyzed using AI-based solutions to facilitate remote and (near) real-time decision-making in healthcare settings. However, patients may not wish to share their raw data due to privacy concerns. One possible solution is to utilize federated learning (FL) where only the trained parameters (and not the raw data) are shared during the model training process. Objective: This scoping literature review explains what federated learning is, and how it can be used to deal with sensitive and heterogeneous data in mHealth applications so that it can help healthcare practitioners understand the associated limitations and challenges of FL in mHealth. Methods: A scoping review was conducted following the guidelines of PRISMA-ScR (Preferred Reporting Items for Systematic Reviews and Meta-Analyses Extension for Scoping Reviews) [1]. We searched seven (7) commonly used databases. The included studies were analyzed and summarized to identify possible real-world applications and associated challenges of applying FL in healthcare settings. A total of 1,095 articles were retrieved during the database search, and 26 articles that met the inclusion criteria were eventually included in the review. Results: Common applications of FL in mHealth include monitoring self-care ability, monitoring mental health status, disease progression, and auxiliary diagnosis of disease. Based on the analysis, we identified a number of research challenges and opportunities, such as those relating to communication costs, statistical heterogeneity, and privacy leakage. Conclusions: While FL is a viable approach to addressing privacy concerns in mHealth applications, there are a number of technical limitations associated with the use of FL as outlined in this article. Hopefully, the challenges and opportunities identified in this


Introduction Background
Mobile health (mHealth) generally refers to the application of mobile and other wearable devices (e.g., smartphones) in a healthcare setting (e.g., electronic health) [2]. In mHealth, healthcare-related data are sensed/collected using digital devices such as biomedical sensors attached to the user's body, or portable devices with relevant applications installed [2]. In order to better help doctors diagnose medical conditions and diseases and support remote, fine-grained, and high-quality precision healthcare, there have been attempts to utilize machine learning technology to analyze patient data [3][4][5]. Generally, datasets used to train a machine learning model are centralized. This implies that the central server has access to the data of all users. However, patients may not be willing to share their (raw) personal information, and the exacting regulatory requirements within the healthcare industry may also limit the sharing of sensitive information. Therefore, there exist medical data silos which will limit the use of conventional machine learning-based solutions.
To mitigate the privacy concerns associated with the sharing of raw patient data, one can use federated learning (FL), since only the trained model parameters rather than the raw data are shared during the learning process. In a FL-based approach, the server hosts a global model and randomly chooses the users who will contribute to the training of the global model at each FL round. After the user selection, the server sends the global model parameters to the selected users, and these selected users will use their data to train the models locally and then send their trained models (rather than the raw dataset) back to the server. The server will then aggregate the models and update the global model, and the same procedures will repeat to the next FL round. However, most existing FL frameworks in the healthcare domain are not designed to support data from mobile and other wearable devices, which is an increasingly important data source, considering the pervasiveness of such devices [6].

Objective
Although FL is a viable tool to support privacy-preserving healthcare decision-making, there are a number of existing limitations and challenges associated with the deployment of such models and tools in clinical practice. Therefore, this research explores and outlines the current progresses of FLbased mHealth applications (e.g., how FL can be used to deal with sensitive, unbalanced, and heterogeneous data across users' mobile devices) and discusses the associated limitations and challenges so that healthcare professionals and policymakers can make informed decision when considering leveraging FL-based solutions.
In summary, this review will: 1. Provide an overview of FL applications in mHealth settings. 2. Help practitioners understand the challenges of applying FL in mHealth settings. 3. Explore and identify potential approaches in developing an efficient, effective, and robust FL system for mHealth applications.

Overview
This scoping review was prepared and reported according to the guidelines of the PRISMA-ScR (Preferred Reporting Items for Systematic Reviews and Meta-Analyses Extension for Scoping Reviews) framework [1].

Data Sources and Search Strategy
Our literature search was conducted using several databases: PubMed, JMIR, Web of Science, IEEE Xplore, ACM Digital Library, ScienceDirect, Springer, for articles published between January 2016 and January 2022. The search was carried out using keyword combinations such as "federated learning" AND "mobile-health*". The abstract was used to extract the necessary information, and to avoid selection bias. A second search was also carried out using keyword combinations such as "collaborative learning" AND mobile AND health* to identify articles published between January 2015 and January 2018 in PubMed, JMIR, Web of Science, IEEE Xplore, ACM Digital Library. This second search was conducted because FL was first coined by Google in 2016, and collaborative learning was the predecessor and used interchangeably with FL.

Inclusion and Exclusion Criteria
This study focused on the applications of FL in mHealth, rather than the broader electronic Health (eHealth) settings. In other words, data used in FL training should be captured or recorded by mobile devices, and how such data and their analyses can be used to solve real-world healthcare problems. Hence, articles that presented a new FL technique but not designed for mHealth applications would be excluded. We also considered only research articles published in journals and refereed conference proceedings, while excluding other publications such as conference abstracts only, books, editorials, and commentaries.

Study Selection
A total of 1,095 publications were retrieved during the database search, and 529 studies were removed using the content type filter in each database and 32 duplicate studies were removed. Of the remaining 534 studies, 487 studies were excluded after we performed title and abstract screening; thus, resulting in only 47 relevant articles. Upon the completion of the screening process, 26 of the 47 articles were found to meet the inclusion criteria and included in the review. The study selection process is depicted in Figure 1.

Search Results
As explained earlier, only 26 of the 1,095 located articles were selected for further analyses. These 26 publications were categorized into eight (8) themes (e.g., mental disorder detection and cardiac health monitoring) based on their potential applications in mHealth.

Applications and Type of Data Used
One application of FL in mHealth settings is the monitoring of self-care ability, health status, disease progression, and auxiliary diagnosis of disease. An example of the latter is the Parkinson's disease, and the associated motor symptoms can be captured and monitored using wearable and mobile devices (i.e., human activity recognition using mobile sensors). There were 13 (out of 28) studies focusing on human activity recognition using mobile sensors, for example how to use FL to distinguish different activities of daily living (ADLs) like walking, sitting, standing, and so on. ADLs recognition is especially useful for self-care ability and health status monitoring of senior citizens [5]. Two of the studies discussed the potential of Parkinson's disease detection using mobile sensor signals of people conducting specific activities, like arm droop, balance, or gait [7,8]. It was also noted that patients with Parkinson's disease will have a freezing gait, which can be easily detected in the patients' daily activities [9]. Another disease that can be monitored in daily life is epilepsy, as noted by one study [5]. Specifically, the authors reportedly used a dataset comprising Electrocorticography (EcoG) recordings of epilepsy patients, which can be used to facilitate the monitoring of such patients in their daily activities [10].
A second common application is to detect mental disorders such as depression and anxiety. At present, diagnosis of mental health disorders depends almost entirely on the subjective judgment of the doctor through communication with the patient and the responses from the patient health questionnaires [6]. Four of the included studies focused on this topic and identified a number of potential research opportunities for mental health disorder detection in mHealth settings. Xu et al. [6] proposed a federated depression detection method. Their evaluations reportedly took place in a hospital, where the participants were each issued a smartphone where data on the keyboard used in a particular session were collected. Data from the participants' weekly Hamilton Depression Rating Scale (HDRS) [11] and the Young Man Mania Scale (YMRS) [12] tests were also collected. The other three related studies demonstrated the use of FL in detecting stress, using data collected from mobile devices during certain activities, and the only difference with ADL recognition was the type of collected data, such as ECGs or EDAs [13][14][15].
Monitoring of cardiac health (e.g., ECG monitoring and heart rate prediction) is another common application. Ogbuabor et al. [16], for example, proposed a decision support system for cardiac health monitoring that makes use of FL. Fang et al. [17] proposed a Bayesian inference FL method for heart rate prediction. Raza et al. [18] designed a FL framework for ECG-based healthcare, and the proposed framework can reportedly classify different arrhythmias effectively. Additionally, the authors [18] also proposed an explainable artificial intelligence (XAI)-based module on top of the classifier to ensure interpretability of the classification results. In doing so, clinical practitioners can interpret the prediction results better.
There have also been attempts to use FL to detect skin disease [19,20], as well as other disease types, such as diabetes management, obesity management, and perioperative complications prognostic prediction. Gong et al. [21], for example, experimentally evaluated their proposed collaborative learning scheme on a diabetes dataset and demonstrated its practicality for mHealth monitoring scenarios. Siddiqui et al. [22] integrated FL with the Internet of Medical Things (IoMT) architecture to detect the risk of obesity in individuals, and BMI data were analyzed to assess the obesity risk and generate expert recommendations. Guo et al. [23] proposed a FL system and experimented on a breast cancer dataset by simulating mobile scenarios manually, demonstrating the potential of future mHealth applications of cancer detection. Sun et al. [24] proposed a FL framework for perioperative complications prognostic prediction and conducted experiments on a real-world mHealth dataset, which suggested utility of the proposed method. A summary of the datasets used in different studies and their applications can be found in Table 1. Most of the studies (14 out of 26) use human activity recognition datasets, which can facilitate the recognition of human activities. For ADLs recognition, smartphones and smartwatches are the most commonly used devices in data collection. An inertial measurement unit (IMU) is a specific type of sensor comprising accelerometers, gyroscopes, and magnetometers, is a common sensor used in activity recognition. As observed in Table 2, the most frequent data types for daily activity recognition are acceleration, angular velocity (gyroscope), and magnetometer. For Parkinson's disease, a specific type of data is used in [8], which is the acceleration for freezing of gait. It is different from normal acceleration data because generally only people with Parkinson's disease exhibit such characteristics. Another Parkinson's disease dataset recorded acceleration and angular velocity when people conduct arm droop, balance, gait, postural tremor, and resting tremor [7]. These five symptoms were ranked into five levels from normal to severe in order to classify normal people and Parkinson's disease patients. Another disease that can be detected when people conduct activities is epileptic, and Electrocorticography (ECoG) has been used to classify epileptic patients [10].
For mental health disorder detection, many different data types can and have been used. Examples include electrocardiogram (ECG), electrodermal activity (EDA), blood volume pulse (BVP), and body temperature, and such data are often used together with some activity sensor data. Skin disease detection, for example, relies on the analyses of dermatological disease-related images [19,20]. The dataset used in [23] is the breast cancer dataset, whose data are digitized images of fine needle aspirate of breast mass. Experiment results suggested that applying FL for image data can potentially achieve high performance.

Challenges in FL-based applications in mHealth settings
There are several key challenges in the application of FL in mHealth settings, including statistical heterogeneity, systems heterogeneity, expensive communication overheads, and real-time data stream -see also Table 2.

Statistical heterogeneity
A large-scale FL system includes a large number of user devices such as smartphones, tablets, smartwatches, and so on. These devices will frequently generate and collect data in a non-identically distributed (non-IID) manner, which would subsequently affect accuracy of the FL-based approach. Moreover, users' patterns are diverse, imbalanced, and heterogeneous, especially when it comes to data recorded during user activities.

Expensive communication costs
In typical FL systems, a large number of user devices would necessitate many global communication rounds between users and the server, and this would impact performance [54]. This reinforces the importance of developing communication-efficient methods in FL frameworks, which should send minimal information (e.g., only the required information or model updates) as part of the training process.

System heterogeneity
In FL networks, the storage, computational and communication capabilities of each device are likely to differ due to variability in hardware, network connectivity, etc. [54]. Such system-related constraint may result in system incompatibility, for example due to data formats, and there is a risk that only a small fraction of devices will be active in a federated network at any one time, and/or active devices to drop out at a given iteration [55]. Consequently, system heterogeneity may impact on the performance of the FL-based approach in practice.

Privacy leakage
While FL is designed to mitigate privacy concerns associated with conventional machine learningbased applications (since only model updates such as gradient information, instead of the raw data, are shared in FL-based approaches), communicating model updates throughout the training process may also reveal sensitive information, either to a third party or the central server [56]. We observe that there have been attempts to design tools to enhance privacy of FL, for example using secure multiparty computation (SMC) [57] and differential privacy (DP) [56,58]. However, SMC and DP are often costly and impact on model performance or system efficiency [55]. Thus, balancing such tradeoffs is a considerable challenge in realizing practical FL-based systems.

Real-time Data Stream
Real-time data (e.g., sensing measurements) is a norm in mHealth settings. Hence, when applying FL in mHealth settings, the FL model should instantly respond to adjustments along with up-to-date generated sensor data. In other words, the FL process needs to be capable of working continuously with incoming data streams in real-time. [4, [6][7][8]10,14,16,[18][19][20]23,[26][27][28][29]36,40,42] Systems Heterogeneity The storage, computational and communication capabilities of each device may differ due to variability in hardware, network connectivity etc. [13,17,29,42] Privacy Leakage Communicating model updates throughout the training process can nonetheless reveal sensitive information, so the privacy of FL needs to be enhanced. [7,10,13,18,21,23,24 ,27,29,30] Real-time Data Stream Sensing measurements are generated consecutively, and the FL process should be working continuously with new emerging data. [6,18,22,28,29,40] In the following paragraphs, we will provide more details on how each of the challenges can be addressed, according to our findings from the included papers.
First, 13 of the included studies addressed the challenge of statistical heterogeneity. For example, Ek et al. [36] empirically evaluated three FL algorithms (i.e., FedAvg [59], FedPer [60] and FedMA [61]), to determine their effectiveness. Findings from their evaluations suggested that the averaging of more personalized models leads to performance degradation when the learned global model is used to evaluate on the test set, hence there still remain a lot of challenges ahead for personalized FL models. In fact, a number of studies were trying to address this kind of challenge, and training personalized models is a common method. To achieve this goal, several studies propose to model heterogeneous data from users and design personalized FL methods. One way to realize this is similarity-aware learning [28,42], through which the FL system can capture the underlying relationships between users, and by clustering users, users in the same group can collaboratively learn personalized models. Another possible approach is to learn from high-level features of deep neural networks because the lower levels focus on learning common and transferable features, for example, Chen et al. [7] performed transfer learning on each user to learn a personalized model, and Raza et al. [18] used the CNN model as the classifier in which the densely connected layers were used to focus on learning specific high-level features. Additionally, it is also possible to boost the representation ability of the embedding network, and an example of this is a federated representation learning framework proposed by Li et al. [4], in which a signal embedding network was meta-trained in a federated learning way, and the learned signal representations were further fed into a personalized classification network for better activity prediction for each user. Moreover, Liu et al. [15] added a simple user embedding to their neural network, which is a simple example of collaborative personalization to deal with statistical heterogeneity. In addition, Xiao et al. [27] designed a feature extractor that is responsible for exploring sufficient local features and global relationships from heterogeneous data, in order to address statistical heterogeneity.
Apart from modeling heterogeneous data, some studies tried to address statistical heterogeneity by guaranteeing convergence for non-IID data. Indeed, when data are not identically distributed across devices, traditional FL methods can diverge in practice when the selected devices perform too many local updates [55]. Accordingly, Xu et al. [6] proposed a multi-view FL framework using multisource data collected from smartphone keyboards, and this framework was proved to ensure the convergence for non-IID data from users' smartphone keyboards. Yu et al. [29] derived a personalized strategy for users that require personalized service, and those users will convey gradients to the server only once rather than participating in all the aggregation rounds in each training iteration, so that convergence can be guaranteed. Moreover, Wu et al. [40] devised a generative convolutional autoencoder (GCAE) network which can generate and synthesize samples of minority classes in order to deal with the imbalanced and non-IID data distribution . Additionally, Gudur et al. [32] proposed a two-version FL framework for addressing heterogeneity issues in non-IID scenarios, which leveraged overlapping information gain across activities -one using model distillation update, and the other using weighted update. And Zhang et al. [13] developed a new local update scheme and an adaptive global update scheme, and these two components jointly allow each device to decide the optimized local and global update strategy to deal with the non-IID problem.
Meanwhile, 19 of the studies addressed the challenge of expensive communication, among which 10 studies [4,[6][7][8]14,16,19,20,36,40] adopted FedAvg [59] to reduce communication overheads since FedAvg [59] scheme is a common method used to overcome challenges such as communication inefficiency and limited computational power on devices. In addition, Xiao et al. [27] proposed a system that would not calculate and circulate the average weights until the model weights from a specific number of connected users are received, aiming to develop a communication-efficient system. Yu et al. [29] devised an unsupervised gradients aggregation strategy together with the use of FedAvg, which was demonstrated to reduce the communication overheads. And Gong et al. [10]'s proposed scheme was based on an algorithm called alternating direction method of multipliers (ADMM) [62], which provided a possible way to decompose the logistic regression model into smaller subproblems that can be locally computed. This can potentially decrease the communication costs. Furthermore, Fang et al. [17] proposed a Bayesian model along with the distributed inference algorithm that can allow parallel processing, in order to improve communication efficiency.
The above studies focus on developing optimization methods that allow for flexible local updating, hence reducing the total number of communication rounds. Instead, the model compression scheme can also ensure communication efficiency by significantly reducing the amount of information communicated in each round. Interestingly, there are a bunch of ways to achieve this. For instance, Liu et al. [26] proposed a collaborative privacy-preserving learning system that mainly considered two different parameter exchange protocols-round-robin and asynchronous, both of which can ensure low communication costs. And Ouyang et al. [42] designed a learned cluster structure for the system, based on which the system would utilize cluster-wise straggler dropout and correlation-based node selection to reduce the communication overheads. Moreover, Tu et al. [28] proposed FedDL which can reduce the number of parameters communicated between users and the server through the dynamic layer-wise sharing scheme, as only the lower layers of local models need to be uploaded to the server for global training. What's more, Raza et al. [18] designed a framework for ECG-based healthcare and proposed a new method called layer selection which can significantly reduce the overall communication costs. Apart from all the above, Guo et al. [23] offloaded health monitoring and model training tasks to hospitals' private servers, which are equipped with relatively stronger computing resources and will protect the privacy of their users, to reduce communication costs.
Next, among all the included studies, 4 studies tried to solve the system heterogeneity problem. For instance, Ouyang et al. [42] leveraged the inherent cluster relationship to dynamically drop nodes during the federated learning process, which means the server will drop stragglers who converge slower than other nodes within each cluster as well as the nodes that are less related to others in the same cluster. Tu et al. [28] utilized the similarity among users' model weights to learn the layer-wise sharing structure, which can be regarded as an asynchronous scheme that can mitigate stragglers in heterogeneous environments. Furthermore, Yu et al. [29] proposed a hierarchical attention architecture for each client to fuse various measurements collected from different sensors across devices because through feature fusing, attention layers can also eliminate the difference in input sequences' length across different sensing windows. In [13], the bottom-up design of the new local update scheme and the adaptive global update scheme allowed the proposed FL system to meet device-specific optimization goals (e.g., energy savings) while strictly protecting user privacy.
There are 10 studies addressing the privacy leakage problem when applying FL. Specifically, there are three main ways to protect data privacy in the FL framework [55]: Secure Multi-Party Computing (SMC) [57] and Differential Privacy (DP) [56,58], and Homomorphic Encryption (HE) [63]. Among the 10 studies, 4 studies [7,10,21,27] utilized HE to avoid information leakage. And 2 studies adopted DP, for example, Yu et al. [29] leveraged DP [56,58] and Byzantine-robust aggregation rule [64] for defenses against malicious clients and preventing data recovery attacks, and another example of this is Guo et al. [23], it proposed two-stage strong privacy protection based on DP in the interaction to resist the external and internal security risks. In addition, other novel methods were also proposed, such as Lyu et al. [30], which proposed a two-stage privacy-preserving scheme called RG-RP delivering great recovery resistance to MAP estimation attacks. And Zhang et al. [13] designed an abnormal health detection (AHD) system that strictly prohibited any violations to meet the strict privacy requirements in AHD. Another instance of this is Raza et al. [18], which realized privacy enhancement by only sharing weights that contain more common and low-level (i.e., less private) features. And for Sun et al. [24], its training process will not reveal any information because of the masked weighted sum, which was uploaded by users and would not cause information leakage.
Finally, 6 of the studies consider real-time data in mHealth. 2 of the studies [18,22] were designed to work continuously with new emerging data generally, while several studies developed some specific methods or utilized unique data sources. To illustrate, Xu et al. [6] predicted the occurrence of depression by recording real-time data of patients using smartphone keyboards. And the system proposed by Tu et al. [28] can periodically update the layer-wise sharing structure and models to deal with users' dynamic data distribution. Another example is Yu et al. [29], which let unlabeled users trained with the online learning method so that the proposed system can utilize only a small number of labeled users with limited samples to train a model with competitive performance along with massive real-time stream sensing data produced by unlabeled users. And in [40], the framework was able to perform incremental learning [65], which means that when facing new user data, both cloud model and user models can be updated continuously. Besides, the learned cloud model, which captured the generic information of users, can be easily deployed as a prior model to a new joining user.

Principal Findings
To advance the understanding of how to apply FL in mHealth, a systematic literature review of 26 published studies was conducted. The findings of this review indicated that FL, a relatively new method which can protect users' privacy, works really well but is still developing. What's more, the results of this systematic review identify the possible application domains of applying FL in mHealth and point out the challenges of applying FL, as well as possible approaches to address those challenges.
One of the goals of this review was to provide a summary of the applications of the FL systems in mHealth. The results indicate that FL can be applied to address many real-life problems in mHealth, such as daily health monitoring, mental health disorder detection, or even disease diagnosis. When applying FL in the mHealth domain, we can collect various types of sensor data, such as acceleration, angular velocity, magnetometer, or even ECG, temperature data, to address different problems.
In addition, this systematic review aimed to help practitioners understand the challenges of applying FL in mHealth, such as expensive communication overheads, statistical heterogeneity, systems heterogeneity, and so on. It provides an overview of what these challenges are and points out the potential impacts of these challenges on real-world FL systems.
This review's final aim was to identify the approaches that can potentially be applied to solve different kinds of challenges. For example, FedAvg [59], a typical FL approach, averages all models from users to learn a single global model. However, such an approach has been shown to suffer from significant accuracy degradation under heterogeneous and imbalanced data distributions of users [53]. So, personalized FL was proposed as a countermeasure to deal with statistical heterogeneity. For communication efficiency, there already exist many other common and well-performing methods, such as FedAvg, or model compression scheme. However, when it comes to data security, studies often adopt some additional privacy protection mechanisms, such as DP and SMC, but these mechanisms would cause the reduced model performance or system efficiency. Thus, it is necessary to balance the tradeoffs between privacy leakage and system performance or efficiency, but these need further research. Besides, systems heterogeneity is a significant challenge when applying FL, but currently, there are only very few studies addressing this issue.

Limitations
This review has several limitations. First, this review only considered five challenges among all the challenges of applying FL, and challenges, such as data scarcity, system reliability, and scalability, are not explored due to page length limitation. Second, although the search ranges lasted for several years, among the final included studies, studies from 2020 to 2021 dominated. The main reason may be that FL was first proposed by Google in 2016, and since then, more and more research started focusing on this field. Furthermore, for the paper collection process, the search terms should have some limitations and were not able to include every related study. However, the studies that we collected initially should include most of the relative studies based on the data sources and the search strategy we used.

Conclusions
This literature review provides information and practical implications for designers and healthcare providers that federated learning is a novel and efficient privacy-preserving learning method that can be adopted in a number of fields, especially in mHealth. There are plenty of applications for using federated learning in mHealth, such as Parkinson's disease detection and monitoring, cardiac health monitoring, mental health disorder detection. On the other hand, there are also many barriers that prevent practitioners from putting it into practice. Particular attention should be paid to the most common obstacles such as communication efficiency, statistical and system heterogeneity, and privacy leakage when applying federated learning. We would like to emphasize the possibility of considering federated learning as an efficient learning method to protect users' privacy, and in the meantime, more research should be conducted to address those common challenges to improve the performance of federated learning systems, in order to provide better services to patients, patients' family, or even normal users.