Introduction

Rising costs of healthcare due to the ageing population and related increase of non-communicative diseases urges for finding ways to save expenses by diminishing the need for care and making the current care more efficient [1]. At present, healthcare provision is reactive and process driven, and patients are treated according to predefined pathways and hardly consider individual necessities and capabilities [2]. As a result, health authorities and healthcare providers are noticing the patient or person resource that had remained unused until now. By starting with the primary need of the person – to be healthy – and including him/her into the process in an active role, new paradigms for healthcare become possible. Significant cost reductions can be achieved by developing preventive solutions to help the person adopt a healthier lifestyle – thus reducing the number of patients – and by providing the person with tools to actively participate in the treatment when diseases do arise – thus decreasing the burden on care personnel [3]. In this context, we have recently observed great advances in technology to empower people self-care [4]. Smartphones and tablets and quantified self-style self-monitoring wellness devices are commonplace more than ever before. Wellness oriented solutions often suffer from short-term use due to quickly diminishing interest from their users [5] and from lack of possibilities to utilize them in conjunction with clinical healthcare treatments [6]. Patients are left alone with their problems in between therapy or treatment, and the possibly collected personal data is left unused. Insights in psychology and gamification allow for approaches to keep people engaged and achieve the desired health impact [7].

The present work is enclosed in the Personal Health Empowerment project (PHE) (https://itea3.org/project/personal-health-empowerment.html). The PHE has the main goal to empower people with Chronic Obstructive Respiratory Diseases (CORD) and to help patients monitor and improve their health using personal data and technology assisted coaching (also known as digital coaching). Some of the innovations in the PHE project include the definition and development of reliable ways to self-measure pain, respiratory function, and behaviour development of analytics on heterogeneous personal health sources. All these variables will provide insight in the relation between behaviour and health specification of methodologies. As a result, interactive, dynamic, and personalized coaching programmes will be developed with the definition of innovative and motivating approaches for long-term adherence, bridging the gap between wellness and care. In this work we present an overview on existing approaches for user modelling according to literature which includes studying available characteristics and techniques to model different user profiles. Then, and as the main goal of this work, the most relevant characteristics and techniques will be identified and integrated into a user modelling approach for patients with CORD, also including information such as: general health data, habits, behaviours, symptoms, diagnosis, historical treatments and therapies of the patient. The proposed user model will be essential to establish the viability and constraints to develop a coaching platform to provide efficient recommendations to patients with CORD with different necessities and capabilities.

In the next section, literature on user modelling is reviewed. In Section 3, it is described the clustering analysis performed for the CORD Management use case and finally, major conclusions are taken in Section 4 as well as the work to be done hereafter.

User Modelling

This section describes different definitions associated with user modelling which includes the main characteristics and techniques existing in the literature.

A user model is composed by a set of characteristics that adjust the content, presentation and navigation to each user. These characteristics can be domain-dependent and domain-independent and are related to beliefs, preferences, knowledge and attributes about the user. Domain dependent (DD) data is related with system responses tailored according to the domain knowledge of a user [8]. For this, it is necessary to perceive user current state and knowledge regarding concepts and relations inherent to the domain, predict how the user will interpret system responses, understand the many different goals and plans of each user, predict and respond to different mistakes while the user is using the system and identify the most adequate way to present information to each user. Different methods can be used to measure user knowledge and expertise regarding the domain: Direct Dialogue and Indirect Acquisition. Direct Dialogue is performed directly with the user in order to assess his/her expertise in the domain. Direct Dialogue features allow users to input and share their knowledge (for example, using questionnaires or forms) and include mechanisms to process the inserted data and measure user knowledge regarding the domain. Indirect acquisition method allows the system to assess user knowledge indirectly according to how the user performs different actions. Depending on this assessment the user knowledge regarding the domain is classified on different levels which in turn are updated over time as the user works with the system. Domain independent (DI) data is not related to user expertise regarding the domain but to his/her cognitive abilities which indicate how the user perceives, thinks, remembers, behaves and solves different problems [8]. In other words, domain-independent knowledge corresponds to the phycological characteristics of the user. There are many different psychological models and tests that can be used to assess user personality such as the Myer-Briggs Type Indicator [9], the Eysenck’s Pen Model [10] and the Big Five Model [11,12,13,14].

After identifying the data related to each user’s characteristics, it is then possible to define the algorithms that will process this data and in turn affect the computational environment. These algorithms are mainly defined using statistical and non-statistical techniques. Among different statistical techniques it is highlighted the use of Beta Distribution [15], Linear Modelling, Markov Model, Bayesian Networks and Rule Induction with statistical data [16]. Examples of non-statistical techniques include the use of an Overlay Model [17, 18], Perturbation Model [17, 19], Knowledge Modelling, Behaviour-based Model [20], Rule-based Model, Stereotypes [21] and Ontologies [22,23,24,25,26].

Personal health empowerment

Two of the greatest challenges associated with traditional healthcare systems are related to the fact that they are reactive and process driven [2].This means that these systems often treat patients according to predefined pathways with limited possibilities to consider individual necessities and capabilities. Furthermore, healthcare tends to be provided to people only when they are diagnosed with certain health issues (reactive), instead of assessing risks in time and preventing adverse health development (proactive).

PHE project is addressing these challenges in healthcare by focusing on patient support and allowing him/her to better manage his/her own health-related behaviour and participate actively in the treatment of the disease (https://itea3.org/project/personal-health-empowerment.html). Wellness oriented solutions usually do not last very long as patients quickly lose interest to keep using these solutions in conjunction with clinical healthcare treatments. Therefore, in the current paradigm, patients are left alone with their problems in between therapy or treatment sessions and any personal data collected is left unused. The main goal of PHE is to change the current observed paradigm by empowering people to monitor and improve their health using personal data and technology assisted coaching. For this, PHE will help to provide both evidence and means to realize people-centric and preventive healthcare and allow for cost-saving self- and home-care solutions with increased patient involvement.

The project intends to apply the innovative measurement, monitoring, heterogeneous data analysis and intelligent coaching solutions to different use cases: Healthy Workplaces and CORD Management. The first use case is related to the necessity to provide ideal workplace conditions from a wellbeing point of view to bring improvements to both the physical and mental health of workers. The second use case is related to CORD which are a public health problem with increasing demands on healthcare systems. Nowadays, there is a growing market demand for solutions that can help to reduce costs, while maintaining the quality of care [27, 28]. Patients with CORD are continuously at risk of deterioration of health, requiring regular medical check-ups and monitoring of their health status. Traditionally health care is delivered through clinicians’ face-to-face interaction. With the growing prevalence of CORD and continuous pressure from healthcare authorities/insurance companies, an increasing number of patients are being managed at home in their own environment and most of the time being left alone with traditional self-management materials (books, leaflets, videos, and web-based technology) [29,30,31]. To overcome the limitations of traditional healthcare systems, new solutions are now being developed such as the development of coaching solutions and mHealth technologies. Coaching solutions appear to be an ideal platform to deliver both simple and effective self-management interventions while maintaining/improving the quality of care and reducing costs [32]. mHealth technologies for CORD should involve monitoring and managing signs and symptoms of the disease, empowering patients to recognize the early signs of exacerbations and to develop skills to better manage their disease. Several monitoring systems have been proposed in the context of CORD management over the last years, but these show evident limitations that should be discussed. A great number of existing proposals already combine different machine learning techniques in order to monitor the health condition of the patient [33,34,35,36,37,38] and provide personalized interactions. In this sense, we have seen systems using techniques such as fuzzy classifiers, artificial neural networks [34, 37,38,39], reinforcement learning [40, 41], among others. However, in the context of CORD management, this monitorization is mainly performed with the goal to analyse patient data and detect respiratory diseases or respiratory complications such as exacerbations [36,37,38] rather than understanding the profile (and associated behaviours) of the patient and anticipating further complications. This means that in some way most existing systems are less proactive and more reactive to the current health condition of the patient. Another issue which may compromise the usability of this kind of systems is that it often requires the use of multiple devices such as a smartphone solution combined with a digital spirometer to analyse respiratory function [36,37,38], which bring additional costs to the average user. With these points in mind, PHE project was designed to provide a solution which relies only upon the use of a smartphone and its embedded sensors to correctly monitor and capture patient data through the application of innovative monitoring algorithms. Furthermore, in this work, we present how patient data captured through the use and interaction with the smartphone will be analysed using machine learning (as will be explained in the following section) to identify different user profiles. This analysis is performed not to detect whether a patient has a certain respiratory disease but to identify through his/her health behaviours the type of user being monitored and then be able to support him more efficiently.

Description of data

In this section, it is presented an overview of all the user characteristics that have been considered for the CORD Management use case. Each identified characteristic is related to either DD or DI data and can be retrieved through different tools such as questionnaires and self-reported data (user input), healthcare records, clustering analysis, etc. Table 1 shows all relevant user characteristics identified for the CORD Management use case. Descriptions and examples of each identified characteristic are also provided as well as the tools will be used to collect that information.

Table 1 User Characteristics for the CORD Management Divided in Domain-Dependent and Domain-Independent Data

All the information related to each characteristic also includes the specification of different high-level and low-level variables that characterize the user’s current health condition and his/her surrounding environment. Examples of these variables are: 1) Body Mass Index, which corresponds to the value derived from the mass (weight) and height of an individual; 2) Use of Continuous Positive Airway Pressure, which is a form of positive airway pressure ventilator that continuously applies mild air pressure to keep the airways open in people who are not able to breathe spontaneously on their own; 3) Obstructive Sleep Apnea, which is caused by complete or partial obstructions of the upper airway during sleep; 4) Air Quality Index, which indicates how polluted the air currently is or how polluted it is forecast to become; etc.

Clustering analysis

A hybrid model to perform the Clustering Analysis for the CORD Management use case was defined by combining two different clustering models (hierarchical and classification models). The architecture of the proposed hybrid model is presented in Fig. 1. It comprises a process with four main steps, which will be explained in more detail in the following sections.

Fig. 1
figure 1

Architecture of the Proposed Hybrid Model Divided in 4 Main Steps (Data Pre-Processing, Hierarchical Clustering, Cluster Validation, and Interpretation of Results – Classification Model)

Data description

The proposed hybrid model will be validated using data obtained from the Control and Burden of Asthma and Rhinitis (ICAR) study (PTDC/SAU-SAP/119192/2010), a nation-wide population-based observational cross-sectional study conducted in Portugal (ClinicalTrials.gov: NCT01771120) [42]. Included participants (n=726) were from the general population and aged 18 years and older. The mean age of the participants was 44 years old, and 63% (n=469) of the participants were females. For each patient, it was collected data on lung function and exhaled nitric oxide, skin prick tests, a structured clinical assessment, and standardized questionnaires. The data collected comprised a total of 1181 variables described in different data formats (numeric, classification, binary, etc.).

Data pre-processing

The first step of the clustering analysis performed in this study was the pre-processing of the data from the ICAR study. To fit the data into the context of the CORD Management use case, the pre-processing activities focused on two steps: data filtering and categorization. In the first step, a manual data filtering was performed to exclude noisy data, i.e., the variables of the ICAR study that were not considered for the CORD Management use case were excluded. In one hand, the total number of variables that comprise CORD Management is of 253 variables and whose description has been provided in Table 1. On the other hand, and as mentioned above, ICAR study collected patient data on 1181 variables. The analysis performed in this study consisted of accessing (1) patient data on variables that are also considered in CORD Management use case, which comprise a total of 96 independent variables, and among those variables (2) exclude any shown with single values. Finally, 93 variables were considered for further analysis. After identifying these variables, the second step of Data Pre-Processing was performed by categorizing each variable according to established categories within medical literature. For instance, GA2LEN Score [42], has established medical scores which describe the probability of a patient having Asthma (a score of 0 suggests the patient is unlikely to have asthma, a score between 1 and 3 suggest the patient may have asthma and further diagnosis is required, and a score of 4 or higher suggests the patient is very likely to have asthma).

Hierarchical clustering

After pre-processing, hierarchical clustering was applied to the dataset to identify different clusters. Firstly, it was necessary to define a dissimilarity matrix in which the distance between different points – i.e. different patients - is measured using a distance function. For this study, the dissimilarity matrix was defined using Gower Distance [43] which is capable of handling categorical data and measuring the distance between two instances Xi and Xj differently according to each considered variable. For that, the following formula is applied:

$$ {s}_{ij}=\frac{\varSigma_{k=1}^N{w}_{ij k}{s}_{ij k}}{\varSigma_{k=1}^N{w}_{ij k}} $$

Where wijk corresponds to the weight for a variable k between the instances Xi and Xj, and sijk corresponds to the difference between the value of a variable k for both instances Xi and Xj.

After defining the dissimilarity matrix, hierarchical clustering can then be performed. The two most common hierarchical clustering algorithms are the agglomerative and divisive clustering. Agglomerative clustering (AC) is a bottom-up approach in which each instance is firstly treated as a single cluster and then pairs of clusters are merged successively until one cluster containing all instances is defined. Divisive clustering (DC) is a top-down approach in which a cluster is firstly defined containing all instances and then the most heterogenous cluster is iteratively divided until each instance is a cluster. The resulting dendrograms containing all clusters formed using both agglomerative and divisive approaches are presented in Figs. 2 and 3, respectively.

Fig. 2
figure 2

Dendrogram generated using Agglomerative Clustering for the CORD Management Use Case

Fig. 3
figure 3

Dendrogram generated using Divisive Clustering for the CORD Management Use Case

Observing the two generated dendrograms, the identification of the algorithm that better suits the data was not clear, and an additional step was required. The cluster validation and the identification of the ideal number of clusters and the most suitable clustering algorithm are explained in the following section.

Cluster validation

In the Clustering Validation phase, each algorithm and generated clusters were validated in terms of size and distance between each cluster, based on the within sum of squares value. This sum serves as a measure to express how close the instances are within a cluster. In other words, the lower the within sum of squares value is, the closer the instances are within the clusters. To visualize this distribution, an elbow curve graph was created showing the within sum of squares regarding the generated clusters for each clustering algorithm. These graphs are shown in both Figs. 4 and 5.

Fig. 4
figure 4

Elbow Curve for Agglomerative Clustering (Bends Observed at 5 and 10 Clusters)

Fig. 5
figure 5

Elbow Curve for Divisive Clustering (Bends Observed at 3 and 10 Clusters)

In case of AC, a bend is observed at 5 clusters and after that there is a significant decrease of the within sum of squares value at 10 clusters. Regarding DC, a bend is observed at 3 clusters and after that there is also a significant decrease of the within sum of squares value at 10 clusters. To have a better idea of what a good number of clusters is for this algorithm and which Clustering method should be considered, the generated clusters should be also analysed in terms of size.

Looking at Figs. 6 and 7, the observations seem similarly balanced for both approaches. With a small cluster number, both approaches are fairly distributed. DC shows a gap of 360 observations in Cluster-2 to 99 observations in Cluster-3 while AC shows a gap of 329 observations in Cluster-2 to 28 observations in Cluster. If the number of clusters increases, DC shows two clusters with very few observations (Cluster-9 with 4 observations and Cluster-10 with only 1 observation) while AC shows clusters with only a few more observations (the smallest cluster is Cluster-7 with 9 observations).Imbalanced clusters could lead to more biased comparisons as some clusters with more instances could outweigh the remaining clusters [44]. For this reason, both approaches seem more adequate with smaller cluster sizes. Furthermore, since there is no clear difference between both approaches with smaller sizes (3 clusters for DC and 5 clusters for AC) it was chosen 5 clusters using AC for the remainder of this analysis given the large size of independent variables considered after the data pre-processing phase.

Fig. 6
figure 6

Cluster Size for Divisive Clustering (Size of Each Cluster Highlighted for 3 and 10 Clusters Division)

Fig. 7
figure 7

Cluster Size for Agglomerative Clustering (Size of Each Cluster Highlighted for 5 and 10 Clusters Division)

Classification model

After identifying and generating the ideal number of clusters, these were classified using C5.0 Decision-Tree algorithm. This algorithm allows to identify, among the 93 independent variables included for this study, those which had more weight in the division of each instance of the defined clusters. In this sense, a cross-validation was necessary to avoid either over and under-fitting the decision model, and this process consisted of training and validating the filtered data from ICAR study over 10 iterations. For each iteration, 66% of the total amount of data was used to train the decision model while 33% was used to validate the decision model. In this 10-fold cross-validation analysis, the classification model with highest accuracy was selected.

According to Fig. 8, the classification model with the highest accuracy (91.4%) was the model obtained in the first iteration. The remaining iterations with higher accuracy levels were iteration 3 (87.3% accuracy) and iteration 10 (89.4%).

Fig. 8
figure 8

Cross-validation Analysis (Highest Accuracy Values Correspond to Iterations 1, 3 and 10)

As can be seen in Table 2 iteration 1 cluster division considered 7 of the 93 independent variables and 7 classification rules (Table 3) that describe the characteristics of each cluster. Regarding variables usage among iterations with higher accuracy (iteration 1, 3 and 10), the variables most used were “Number of inhalers”, “Sensitization to at least one indoor allergen” and “At least one factor for asthma related death”, which were all used in the cluster division for each one of these iterations. After that, the two variables most used for cluster division were “Number of asthma exacerbations on the previous year” (iteration 3 and 10), and “Bronchodilator reversibility based on previous spirometry” (iteration 1 and 10). In fact, all variables considered in iteration 10 were also considered in either iteration 1 or iteration 3. Furthermore, iteration 3 had the highest number of variables considered in cluster division (13 of the 93 Independent variables).

Table 2 Independent Variables Usage for Iteration 1, 3 and 10
Table 3 Classification Rules Generated for Iteration 1

Results obtained in the first iteration and after generating the decision tree reveal the importance of the variables in the division of each instance per cluster. The variable with the highest usage percentage was the use (or not) of inhalers. Within the group of patients who used inhalers (clusters 3 to 5), the most relevant variable was the presence (or not) of at least one risk factor for asthma-related death. Cluster 3 is characterized by having at least one risk factor for asthma-related death including severe asthma exacerbation in the last year and cluster 5 is characterized by having at least one risk factor for asthma-related death but not severe asthma exacerbation in the last year and clusters 4 by none of these two variables. The group of patients who did not use of inhalers is further characterized by the sensitization to at least one indoor allergen (cluster 1) and by the bronchodilator reversibility based on previous spirometry (cluster 2).

Clusters distribution seems to suggest that cluster 1 groups patients with allergies but not any respiratory disease; cluster 2 groups patients that although probably having asthma or COPD do not use inhalers to control their disease; cluster 3 groups patients with moderate to severe rhinitis and asthma that is not controlled despite the use of inhalers; cluster 4 groups patients with the use of inhalers for respiratory disease, and without exacerbations; and cluster 5 groups patients with mild rhinitis and asthma that is not controlled despite the use of inhalers.

Each obtained cluster will allow us to define specific user profiles within the scope of the CORD Management use case which in turn can provide adjusted content to the patient based on his profile. This includes personalized recommendations and coaching plans targeted at specific user profiles.

Conclusions and future work

CORD are a public health problem with increasing demands on healthcare systems. Nowadays, there is a growing market demand for solutions which can help to reduce costs, while maintaining the quality of care. Patients with CORD are continuously at risk of deterioration of health, requiring regular medical check-ups and monitoring of their health status. Traditionally health care is delivered through clinicians’ face-to-face interaction. With the growing prevalence of CORD and continuous pressure from healthcare authorities/insurance companies, an increasing number of patients is being managed at home in their own environment and most of the time being left alone with traditional self-management materials (books, leaflets, videos, and web-based technology). To overcome the limitations of traditional healthcare systems, new solutions are now being developed such as the development of coaching solutions and mHealth technologies. Coaching solutions appear to be an ideal platform to de-liver both simple and effective self-management interventions, while maintaining/improving quality of care and reducing costs. mHealth technologies for CORD should involve monitoring and managing signs and symptoms of the dis-ease, empowering patients to recognize the early signs of exacerbations and to develop skills to be active participants in management and treatment of their disease.

In this paper it was presented an overview on user modelling including characteristics and techniques most frequently used to model different users. The PHE project was presented and the clustering analysis to identify different clusters of users for the CORD Management use case was defined. For this, several required steps were described, and it was possible to identify 7 different rules which describe a total of 5 clusters of users within the CORD Management use case. The obtained clusters will be essential in the development of the PHE healthcare solution in terms of building a more personalized coaching solution that can adapt all provided recommendations according to the characteristics of each user which ultimately make it possible to enhance and improve user current health condition and promote a healthier lifestyle.

As future work we intend to validate and test the personalized solutions which will be developed starting with single-visit validation in a clinical environment and finally with observational validation in a community environment. Ultimately, the goal will be to assess the potential of self-monitoring and personalization on reducing exacerbations and symptoms when used in between healthcare appointments. We believe that data generated during validation period will also create new opportunities for research and for the development of further innovative healthcare services to better support patients suffering from chronic obstructive respiratory diseases.