Human activity recognition based on multi-instance learning

Human activity recognition (HAR) is the process of classifying a person's actions, and it is an essential task for many human-centered applications. Multi-instance learning (MIL) is a special case of machine learning where the training examples are bags containing many instances, and a single class label is assigned for an entire bag of instances. In this study, we integrated these two concepts by introducing a novel approach: “ human activity recognition based on multi-instance learning ” , called HAR-MIL. Unlike previous studies, the proposed HAR-MIL method represents human activities differently: as a bag of various wearable sensors (gyroscope, magnetometer, accelerometer, and linear acceleration). HAR-MIL presents an applicable and flexible model by providing multi-instance representation and eliminating the restrictions of traditional single-instance representation. Therefore, the adverse effects of missing data, defective sensors, and biased measurement on activity classification performance were minimized. This study is the first to investigate the performance of two MIL algorithms (SimpleMI and MIWrapper) on HAR. In this study, we explored the effect of four main factors (sensor positions, sensor types, base learners, and single or multiple participants) on the multi-instance representation of relevant human daily activities. The effectiveness of the proposed HAR-MIL method was demonstrated on 50 participant-based and sensor-position-based activity recognition datasets. The experimental results showed that HAR-MIL is effective for wearable sensor-based HAR with high classification accuracy (99.32%). Furthermore, the results showed that the proposed method outperformed the state-of-the-art methods by 10% on average on the same dataset.

In HAR, ML algorithms are usually applied to observations collected by cameras (Supriyatna et al., 2019), radars (Li et al., 2019), or wearable sensors (Daş & Birant, 2021;Kerdjidj et al., 2020;Tian et al., 2019;Wang et al., 2019). Camera-based HAR approaches have some drawbacks, such as requiring enough lighting, limiting the area, being an expensive solution, and potentially breaching personal privacy. Radar-based HAR approaches have a limited operating range and are generally used in military and security operations. Wearable sensor-based HAR approaches focus on action data gathered by a single sensor or a set of sensors. Wearable sensors are becoming more popular as sensor technology advances. These sensors are directly or indirectly installed on the human body. Typical examples are accelerometers, global positioning system sensors, gyroscopes, microphones, magnetometers, and biosensors. These sensors can be placed not only on watches, goggles, belts, hats, shoes, or smartphones but also on electromyography (EMG) devices composed of surface electrodes. Body temperature, pulse, speech, posture, and motion data can be collected using sensors.
Using such sensors is not only useful for HAR but also efficient for other activity recognition problems, such as animal activity recognition (Birant & Yalniz, 2022). In this study, we used data collected from five smartphones placed on five body positions (left pocket, right pocket, belt, upper-arm, and wrist) with four types of sensors (accelerometer, linear acceleration, gyroscope, and magnetometer).
Generally, using raw sensor data directly in an intelligent model is not functional because it lacks significant information to specify and identify human activities. Thus, in this study, several data preparation methods were used to obtain relevant information from raw data, including signal filtering, signal segmentation based on the sliding window, and feature extraction.
Multi-instance learning (MIL) is defined as a type of supervised ML, in which a set of related training instances are grouped, called a bag, and a single class label is assigned to each bag. The purpose of MIL is to construct a classifier from a set of training bags with labels; and then; use the classifier to predict the labels of new (unseen) bags. In binary multi-instance classification, the label is set as positive if a bag has at least one positive instance; nevertheless, if all data instances in a bag is negative, the bag is labelled negative.
Several techniques for solving MIL problems have been discussed in the literature. Converting the bags to a standard representation method can be given as an example of those approaches. The foremost algorithms in MIL are SimpleMI and MIWrapper because many others are only suitable for binary multi-instance (MI) classification. Therefore, in this study, SimpleMI and MIWrapper methods were used individually with different classification algorithms as base learners individually, including decision tree (DT), k-nearest neighbours (KNN), neural networks (NN), partial decision tree algorithm (PART), support vector machines (SVM), AdaBoost (AB), and random forest (RF).
This study makes six significant contributions. (i) This study combines two concepts (HAR and MIL) differently and it proposes a novel approach, called HAR-MIL, such that human activity is represented as a bag of different wearable sensors, including an accelerometer (A), gyroscope (G), magnetometer (M), and linear acceleration (LA). (ii) This study investigates the optimal combination between MIL algorithms (SimpleMI and MIWrapper) and base classifiers (DT, SVM, NN, PART, KNN, RF, AB) for wearable sensor-based HAR problems. (iii) To the best of our knowledge, there is no MIL-based HAR study that answers the question of which sensor location (left pocket, right pocket, belt, upper arm, and wrist) provides more meaningful and key data to represent and recognize relevant daily activities. (iv) This is the first study to investigate different sensor configurations (A + G + M + LA, A + G + M, A + G, A + M, and G + M) in MIL-based HAR to determine which sensor configuration in a MI problem captures the relevant activity more accurately. (v) We also explore and compare the performances of MIL-based HAR for each participant individually (personalized classifiers) and all participants jointly (generalized classifier) in this study. (vi) The proposed method achieved higher performance (10% improvement on average) than the-state-of-the-art methods on the same dataset.
This study specifically answers the following important research questions: 1. How can participant datasets be used to improve MIL-based HAR, either individually or jointly? 2. What is the key combination of MIL algorithms and base classifiers for the best classification performance?
3. Which sensor type and position provide more useful data for multi-instance-based daily activity recognition? 4. Which combination of wearable sensors performs the key role for MIL-based HAR?
In the experimental studies, four experiments were conducted to investigate the impacts of participants, classification algorithms, sensor locations, and sensor types on MIL-based HAR for the first time. The proposed approach was assessed on 50 real-world activity recognition datasets, and the experimental results showed that the proposed HAR-MIL method effectively recognizes seven different physical human activities, including walking, standing, jogging, sitting, biking, upstairs, and downstairs.
This study is structured as follows. Section 2 presents the related literature and previous studies on HAR and MIL. Section 3 provides the background information about the HAR, MIL, and base learners used in this study. This section also explains the proposed approach with details of each step and the algorithm. Section 4 presents a short explanation of the dataset and provides the experimental results. Finally, concluding remarks and future work are presented in Section 5.
A HAR system framework mainly includes the following steps: data acquisition, data pre-processing, signal segmentation, feature extraction and selection, training, model evaluation, and classification. Some earlier studies primarily focused on this issue because the feature extraction phase plays an essential role in the performance of the final system. For example, Attal et al. (2015) compared three different unsupervised learning techniques with four different supervised learning techniques in terms of precision, recall, specificity, and F-measure. Their results showed that the extracted features better characterized the activities. Similarly, Shoaib et al. (2014) considered four different feature sets to achieve more successful activity recognition results. In the literature, most HAR systems (Carpineti et al., 2018;Zokas & Lukosevicius, 2018) use the classification method as a supervised learning technique, even though other different types of ML techniques, such as semi-supervised and unsupervised (Machado et al., 2015;Yassine et al., 2017) have also been successfully used to provide information about human activities from raw sensor data.
Currently, different sensor types have been used in the design of HAR systems, such as wearable, vision-based, and environmental sensors.
In this study, we used wearable sensors because they provide many advantages, such as wide coverage area, privacy protection, and high robustness when modelling activity recognizers. Wearable sensors can be effortlessly attached to various body parts (waist, arm, leg, wrist, etc.) depending on the human activities being studied. Mukherjee et al. (2017) explored the optimal sensor location for a group of human activities and the most relevant sensor types for classifying different activities. Similarly, Shoaib et al. (2014) explored different phone locations to examine whether changing smartphone positions enhances activity recognition performance or not. Table 1 shows that the most commonly used sensors for activity recognition are gyroscope (G) and accelerometer (A), while the rarely used ones are the magnetometer (M) and linear acceleration (LA). Accelerometer and gyroscope sensors are prevalent selections for measuring position or movement, because of their low cost, lightweight (few grams), small size (few mm), ease of the program, and they can be easily attached to the body using either an adhesive or a strap. Tian et al. (2019) presented a two-layer multiclassifier method for HAR using a single wearable accelerometer. In our study, we used four different sensors (A, G, M, and LA) to improve human recognition performance. However, the main here is to use these sensors jointly in a bag to define a human action.
Currently, most HAR applications have been developed for smartphones because phones include several sensors (i.e., accelerometer, gyroscope), and they provide useful, wearable, easy-to-use computing platforms with computing power, and storage. For instance, Zokas and Lukosevicius (2018) constructed a recognition system to recognize certain types of human activities using a gyroscope and acceleration data which were collected using a smartphone. Nowadays, a few HAR studies focus on activity recognition on different devices such as biological devices or smartwatches (Ehatisham-Ul-Haq et al., 2020;Weiss et al., 2019). Some HAR applications use inertial sensors, while some of them use external sensors such as health sensors (Kantoch, 2017) or environmental sensors .
Gait analysis is an important and main branch of HAR. Calculation of joint angles from sensor data provides a successful solution to recognize gait patterns because human gait is the manifestation of change in the joint angles of the hip, knee, and ankle (Bijalwan et al., 2021). Joint angle T A B L E 1 Comparison of this study with the previous studies.   (Gao et al., 2019). In contrast to these present types, a different classification type, called multi-instance classification, was proposed for HAR in this study. Here, an entire human action is represented as a bag of instances, where each instance is a feature vector extracted from different sensor data obtained at a certain time.
In this study, we do not consider the shallow (single-instance) classification algorithms, such as most HAR systems .
Instead, we used a more sophisticated but practical human activity recognizer based on multi-instance learning (HAR-MIL).

| Literature review for multiple instance learning
Recently, MIL has been used in various areas, including health (Yousefi et al., 2018), bioinformatics (Bandyopadhyay et al., 2015) education (Ma et al., 2019), and security (Stiborek et al., 2018). Quellec et al. (2017) reviewed existing studies on medical image and video analysis as MIL problems in terms of modelling strategies. Campanella et al. (2018) used a deep MIL to make classifications and localizations in pathology.
The algorithms proposed for MIL can be grouped into three categories. (i) Some new algorithms were proposed particularly for MI problems, such as multi-instance diverse density (MIDD). (ii) Some standard learning algorithms were adapted to deal with MI data, such as MISVM (multiinstance SVM) and MILR (multi-instance logistic regression). (iii) Instead of adapting the algorithm, the data were used to transform it into a standard representation, including MIWrapper (Frank & Xu, 2003) and SimpleMI (Dong, 2006). In this study, we used the last algorithms because they are suitable for both binary-class and multi-class classification.
To the best of our knowledge, MIL-based HAR has never been considered in detail; only a few studies (Saha et al., 2021, Mosabbeb et al., 2015, Stikic & Schiele, 2009) considered both concepts together, but in different ways. Unlike previous studies, the proposed HAR-MIL method represents the human activity phenomenon differently: as a bag of various wearable sensors (A, G, M, LA). Because MIL-based HAR has received little attention in the literature, we performed a comprehensive study, comprising the effects of four main aspects (sensor types, sensor positions, base learners, and single-multiple participants) on HAR-MIL for the first time.

| MATERIALS AND METHODS
This section includes background information about the methods used in this study and explains the proposed approach, "human activity recognition based on multi-instance learning" called HAR-MIL.

| Human activity recognition
HAR is denoted as the task of correctly detecting human action based on the observations gathered by cameras or sensors. Initially, researchers focused on human activity classification from images and videos, but as sensor technology advances, they investigated recognizing activities using sensors. Sensor-based HAR can be categorized into two groups: wearable sensor-based HAR (WSHAR) and ambient sensor-based HAR (ASHAR).
ASHAR applications sense activities from fixed sensors that are embedded in an environment (i.e., home, factory) or attached to some specific fixed object in a predetermined point of interest, such as a door, fridge, wall, and so on (Yassine et al., 2017). Typical examples are light, reed switch, temperature, flow, pressure and vibrations sensors, RFID tags, infrared sensors, video cameras, and microphones. ASHAR systems are suitable, specifically for controlling security (i.e., intrusion detection) and the care demands of older people living at home. However, it only works with predetermined infrastructures in a limited area where the fixed sensors are placed. WSHAR applications, an alternative to ASHAR systems, identify activities by analysing raw data collected from sensors within wearable smart gadgets, such as smartphones, smart glasses, and smartwatches. These enable enormous applications because they cover a relatively large space . In this study, we focus on wearable sensor-based HAR because of its popularity.

| Multiple instance learning
In MIL, each learning sample is defined as a bag that contains multiple instances. Besides its feature values, the sole information about an instance is its membership relationship to a bag (Frank & Xu, 2003). MIL differs from traditional (single) instance learning by learning multiple instance dataset that includes bags of instances. MIL can be mathematically expressed as follows. Given χ as the instance space. The dataset D includes a set of bags and it is denoted by t is the number of instances in the bag, and n is the number of training bags. The instance x kj indicates the k th instance in the bag B j . The classification task occurs at the bag level.
However, the multiple instance classifier takes an unseen bag as the input and predicts its class label. Even though several multiple instance learners, SimpleMI and MIWrapper combined with seven different classification algorithms were individually used on HAR datasets in this study.
SimpleMI: SimpleMI is defined as the implementation of traditional learning algorithms on MIL applications (Dong, 2006).
The basic idea behind this algorithm is to summarize each bag and build an instance representing the entire bag. In this procedure, the entire MI dataset is transformed into a single-instance dataset. Therefore, a base learner can be executed without restriction. The class label is predicted using the single-instance model directly. Three alternative summarization techniques can be employed in SimpleMI (Dong, 2006). The first method is the arithmetic mean, which evaluates the average of each attribute for each bag. In this method, a single instance is considered to represent the entire bag. Generally, the mean (M) of instances of each bag is calculated to compress multiple instances into an average of these instances, as shown in Equation (1). Following that, the algorithm creates a new instance where the value of each attribute is equated to the mean.
Another summarization technique is the geometric mean, which evaluates the geometric mean of each attribute for each bag. Attribute values are summarized by multiplication, which is then calculated as the nth root of the product, where n is the number of dimensions: Equation (2). If m g1 is the mean value of the first dimension, the summarization of a bag can be represented using a new instance m g1 , m g2 , m g3 ,…, m gn À Á .
The final option is the minimax method, where the minimum and maximum values of each dimension are considered for each bag. Therefore, the summarization is represented as minx 1 , minx 2 , minx 3 , :…, minx n , maxx 1 , maxx 2 , maxx 3 ,…maxx n ð Þ . The arithmetic mean approach was used for SimpleMI in this study. MIWrapper: MIWrapper is a multiple instances classification method that considers two major concepts that provide good performance (Frank & Xu, 2003). The first concept considers an appropriate weighting system where each instance is weighted in line with the corresponding bag size. This ensures that each bag is treated equally. The weight of the kth instance (w k ) in a bag is set as follows where p x is the number of instances, P B is the number of bags, and m k is the number of instances in the kth bag. MIWrapper uses the inheritance rule given shown in Equation (4).
where lr x ð Þ is the instance-based learner, and LR X ð Þ is the bag-based learner.
The second concept considers diverse density collective assumption, instead of the standard multiple instance assumption (Frank & Xu, 2003). The collective diverse density assumption assumes that the contribution of all instances of a bag is independent and equal to the class label of a bag. Therefore, the probability of each instance of a bag can be averaged. The bag level probability (P LR ) for a positive bag B þ j is given as follows: it can be specified for a negative bag B À j , where X is the number of instances in the jth bag, x is a single point existing in the feature space and t is the true concept (Dong, 2006). The candidate for the true concept is determined by measuring its closeness to different positive bags. The more positive bags mean, the more likelihood is achieved. To measure the likelihood, the probabilistic measure of diverse density (D d ) is proposed (Oded & Tomás, 1997) as follows Because the goal is to maximize the likelihood, a general mathematical expression of the maximum diverse density is evaluated as follows The candidate point must be closer to the positive bag while it should be as far as possible from the negative bag. The first product in Equation (8) is the measurement of the candidate's closeness to the positive bag, whereas the second product term is the measurement of how far the candidate is away from the negative bags.

| Proposed approach (HAR-MIL)
In traditional HAR systems, an activity label is assigned to a single feature vector, which is extracted from one or more sensor data. However, when multiple sensors are attached to a human body to recognize the activity, they are jointly sampled, and each sensor is responsible for measuring a specific type of data. Because they are strongly related to each other, their data can be unified in a MI manner to reach a final consensus. Therefore, multi-sensors formed as multi-instances can significantly improve recognition. Based on this motivation, we proposed a novel approach called HAR-MIL in this study. Our approach integrates multiple sensors in a unified manner using by utilizing an activity-based MIL framework. Multiple sensor data of activity are considered a bag of instances, where each instance is a feature vector extracted from one sensor data at a particular time. Figure 1 shows the flowchart of a HAR-MIL approach, including data acquisition, pre-processing, signal segmentation, feature extraction, MIL representation, training, performance evaluation, activity classification, and decision support. In the data acquisition step, raw data were obtained at a particular sampling rate from multiple types of sensors (A, G, M, and LA) placed on a wearable device (i.e., smartphones, smart glasses, smartwatches, and smart shoes), during human action. After that, the raw data were transferred to the data collection center through a transmission technology (i.e., Wi-Fi, Bluetooth). The data pre-processing phase mainly involves filtering the raw data. The signal segmentation phase splits a large signal data stream into smaller fixed-size chunks (windows) via a sliding window technique, and human activities are related to each window.
In the feature extraction phase, informative features are extracted from each signal segment by additional processing to improve the generalization performance of the HAR system. These features are the number of zero crossings, RMS velocity, crest factor, root mean squared (RMS) value, kurtosis, skewness, peak-to-peak value, and signal entropy. In the MIL representation phase, the feature vectors, which were obtained for each sensor, are grouped to form a bag. Therefore, each instance contains a feature vector extracted from a particular sensor, and each bag includes the instances from all sensors. In the training phase, the bag collection is fed to two MIL algorithms (SimpleMI and MIWrapper) to build models using the individual contributions of various base learners, such as DT, NN, SVM, PART, KNN, RF, and AB. The evaluation phase uses the k-fold crossvalidation method to assess how well the constructed models recognize human activities. The classification model that performs better with the test set is chosen for HAR. The classification step uses the HAR model to identify the physical human activities of individuals, such as walking, Overview of the proposed approach: Human activity recognition based on multi-instance learning (HAR-MIL).
sitting, jogging, biking, standing, downstairs, and upstairs. Each step in the workflow involves several methods and corresponds to research questions to be addressed. Furthermore, the classifier must be updated periodically by following the same sequence to achieve high accuracy.
In the proposed approach (HAR-MIL), each bag contains a set of instances, each of which belongs to an independent sensor. Let S sensor where t is the number of sensor types. The training set D consists of a set of datasets D ¼ D j È É t j¼1 ; each one contains instances belonging to a different sensor type such as A, G, M, and LA. Each dataset D i includes a set of pairs of instances and their activity labels, which are denoted by D i ¼ x i1 , y 1 ð Þ, x i2 , y 2 ð Þ,…:, x in , y n ð Þ f g for i ¼ 1, 2,…, t, and the output attribute Y ¼ y 1 , y 2 , …,y n f ghas k human activities, where where d is the dimension (also the number of extracted features).
We construct a MI dataset MD containing a set of bags and their activity labels such that where A j is the activity (class label of the bag B j ). For instance, in a three-activity classification (standing, sitting, walking), the class labels of the bags are A1 = standing, A2 = sitting, and A3 = walking. Each bag Bj consists of t instances such that A classifier Z is built to predict the class labels of unseen bags in the given test set T.
The pseudo-code of the proposed HAR-MIL method is presented in Algorithm 1. First, data preparation methods (filtering, segmentation, feature extraction) are applied to each sensor data separately in the algorithm. After that, n instances from t sensors are organized into bags. Each bag is labelled with a corresponding activity a. Here, a MI dataset MD is generated, which collects all instances from each different wearable sensor into a bag. The MI dataset is then used to train a classifier Z. In the final step, the classifier is used to predict the activity class label of each unseen bag in the test set T.

| Data preparation
To prepare data for the HAR-MIL method, the following steps were implemented: data pre-processing, segmentation, and feature extraction.

| Data pre-processing
The accelerometer measures the acceleration of an object along three directional axes: x-axis (vertical axis) (A x ), y-axis (horizontal axis) (A y ), and zaxis (sideway axis) (A z ). In the pre-processing stage of this study, the signal magnitude vector of acceleration, A mag , was calculated using these Algorithm 1 Human activity recognition based on multi-instance learning (HAR-MIL).
Inputs: D ¼ D 1 , D 2 ,…, D t f g a set of datasets T is the test set g (a set of outputs that are assigned to each bag in T) three axes' components as follows: In addition, a low-pass filter with 0.6 Hz cut-off frequency f c was applied to the sensor raw data to isolate the gravity component. After that, the gravity value was subtracted from the original data to find the body acceleration value.
An illustration of the raw sensor signals located in the left pocket of a participant is shown in Figure 2.

| Signal segmentation based on sliding window
In sensor-based HAR, signal segmentation is a method of splitting a large signal data stream into smaller chunks for processing, where each chunk corresponds to an activity. Signal segmentation is crucial stage because the efficiency of feature extraction and recognition accuracy directly depends on it. Generally, windowing approaches are used for signal segmentation. In the windowing approach, the sensor signals are segmented into fixed-size chunks, referred to as windows, and a single label is assigned to all the samples within the window. A window is defined as a set of adjacent sequences such that X ¼ x r , x rþ1 , …,x rþwÀ1 f g , where w refers to the window size and r corresponds to an arbitrary position, such as 1 ≤ r ≤ n À w þ 1, where n is the data size. Furthermore, each window consists of a set of events, that is, S ¼ e 1 , e 2 , …, e p f g , where p is the number of events. In this study, we used the sliding window technique to segment the sensor data because it has been useful for the recognition of periodic signals (Shoaib et al., 2014).
A critical factor in the sliding window technique is selecting a suitable window size to achieve correct recognition since the ideal window size varies according to the characteristics of the signals being evaluated. Usually, small window sizes are better for detecting faster-changing activities; however, using short windows may lead to misclassification because a signal is sliced into multiple windows and thus some vital information about an activity may not be captured by a small window, especially in the case of semi-complex and complex activities (i.e., transportation modes). Alternatively, large windows are usually useful for the classification of semi-complex and complex activities (Akbari et al., 2018). In addition, a large window may contain signals that belong to more than one activity, because a single label is assigned to all examples within the window, decreasing classification accuracy. Based on this tradeoff, users usually determine the optimal window size using empirical methods while experimenting with various fixed window sizes and evaluating classification accuracy. In this study, the length of each window was set at 1 s because the data were collected at a rate of 50 samples per second, and it is a suitable sampling value to provide a reasonable compromise in HAR performance.
Segmentation based on the sliding window technique can be performed in either non-overlapping or overlapping ways. Overlapping is expressed by a percentage that defines how many samples from the previous window will be repeated in the current window, that is, . On the other hand, non-overlapping indicates that the values in one window do not intersect with the values of another window, that is, X 1 AE X 2 ¼ ;. It is a common practice to use overlapping windows to reduce information loss, particularly at the edges of the windows. Although different overlap values could be used, in this study, we used the sliding window technique considering a 50% of overlap between consecutive windows, based on the report of the study (Shoaib et al., 2014).

| Feature extraction
In the raw sensor data, only one specific value at a particular time instant of observation does not carry sufficient and in-depth information to describe human activity. Feature extraction typically transforms the original sensor data into more elaborate and informative features for HAR F I G U R E 2 An illustration of raw HAR sensor signals located in the left pocket of a participant. . Feature extraction is designed to capture a more exhaustive and useful representation of the sensor data to correctly distinguish different human activities, such as kurtosis and skewness. Therefore, the choice of these features is critical because of the selection of the classification algorithm to generate a good classifier for HAR. Features are computed from each signal segment (window) in a frequency domain or time domain. Frequency-domain features describe the periodical properties of a signal over the window such as entropy; whereas, time-domain features are statistical summaries of the data samples, such as RMS, and peak-to-peak values.
Frequency-domain and time-domain features are usually extracted from one axis of a sensor, called single-axial features, such as the mean value of the x-axis of the accelerometer measure. Furthermore, multi-axial features can be extracted as a combination of multiple axes of single or multiple sensors, such as signal magnitude. In this study, both single and multiple axial, as well as frequency-domain and time-domain features were individually extracted from four sensor data (A, G, M, and LA) by considering three axes (x, y, and z-axis). Therefore, in this study, 112 features were extracted for each sensor type individually. The types of extracted features, including their descriptions and corresponding statistical formulas, are given in Table 3.

| Base classifiers
This section describes the individual base classifiers used in the MIL task in this study for activity recognition, including DT, NN, SVM, PART, KNN, RF, and AB. We selected these algorithms because most of them appear among the top 10 data mining methods (Wu et al., 2008).
Decision tree (DT): DT classification algorithms use attribute values to divide decision space iteratively to predict unknown class labels. The structure of the DT comprises nodes (attributes), branches (attribute values), and leaves (class labels).
Neural network (NN): Inspired by a human's nervous system, ANNs are mathematical models that have been widely used in the classification.
A NN contains a set of neurons that work together with weight values to ensure that classification is performed successfully. A typical NN consists of an input layer, single or multiple hidden layers, and an output layer.

Support vector machine (SVM):
In this algorithm, the classification is performed by defining borders among different instance groups. These borders are drawn at the furthermost distance to the instance groups. The algorithm calculates the distances to define the borders, which can be linear or nonlinear.

Partial decision tree algorithm (PART): This classification algorithm is a combination of the separate-and-conquer (rule-based) and the divide-
and-conquer (decision tree-based) strategies. Nevertheless, it is different from traditional approaches using the creation method of each rule. It uses the leaf with the largest gain to create a decision rule, and consequently, eliminates the tree for the existing set of instances.

K-nearest neighbour (KNN):
In this algorithm, an unlabelled instance is assigned to a class, which has the most members when considering KNNs. The distances between the instance and its neighbours are calculated using various metrics such as Euclidean, Minkowski, and Manhattan distance measures.

Random forest (RF):
It is one of the ensemble ML methods, which includes multiple decision trees. A set of DTs, which are built from the diverse subsets of the dataset, are combined to produce the final prediction.
T A B L E 3 Features extracted from sensor data.

Number of zero crossings
The number of times the signal crosses the zero-level within each window. It is calculated using the sign function, which gives 0 for negative arguments, and 1 for positive ones.
Peak-to-peak value The difference between the maximum value of the window and the minimum value of the window.

Root mean squared (RMS)
The quadratic mean value of the sensor signal over the window. It is a measurement of signal power. The degree of asymmetry of the signal distribution of the samples within a window.
Crest factor The ratio of the maximum sensor signal value to the RMS values over the window. It is used to describe the impulsiveness of a signal.

Root mean squared (RMS) velocity
The quadratic mean of the speed of the signal over the window.
Signal entropy Signal entropy, which provides an estimation of the amount of information.
xj AdaBoost (AB): It is an ensemble ML technique that combines multiple weak learners to form a strong learner. In each step, it improves the success of the model by reweighting instances that are difficult to classify. The outputs of weak classifiers are combined based on a weighted voting mechanism.

| EXPERIMENTAL STUDIES
The practicality of the proposed approach (HAR-MIL) was evaluated on real-world datasets using two MIL algorithms (SimpleMI and MIWrapper) with the combinations of seven classification algorithms (DT, SVM, NN, PART, KNN, RF, and AB) using the machine learning tool (Witten et al., 2016). Except for AdaBoost and KNN, the input parameters of the classification algorithms were considered by their default values. For AdaBoost, the C4.5 DT algorithm was chosen as the base learner, because it has good generalization capability. For KNN, The k (the number of neighbours) value was determined by considering k ≈ log 2 n ð Þ based on the previous study (Dexun et al., 2013), where n indicates the number of bags in the MI dataset. The default value of k is 1. Nevertheless, it is too small for a MI dataset, where large numbers of bags are available. We compared the alternative techniques in terms of classification accuracy, which is the percentage of testing cases predicted correctly by the classifier. Classification accuracy values were evaluated using a ten-fold cross-validation method, where the data is randomly split into nonoverlapped and equal-sized divisions, and one of those divisions is used for testing, while the remaining are used to build the classifier.
We designed four experiments to investigate the effects of the following main factors (Table 4): • Experiment 1-Impact of participant: We explored the activity classification performance of MIL-based HAR when both training and testing datasets come from a single participant (personalized classifier) or all participants (generalized classifier).
• Experiment 2-Impact of algorithms: We evaluated the performances of different MI classification algorithms with different base classifier combinations.
• Experiment 3-Impact of sensor position: We investigated the possible SP (left pocket, right pocket, belt, upper arm, and wrist) to determine whether changing SP improves the classification accuracy obtained by the proposed method (HAR-MIL).
• Experiment 4-Impact of sensor type: We measured the impact of the various combinations of four sensors (A, G, M, and LA) to determine the most effective combination that performs best for the MIL-based HAR.

| Dataset description
In this study, we used a large-scale and publicly available sensor activity recognition (SAR) dataset (Shoaib et al., 2014) to evaluate our method.
The dataset was collected for seven basic activities, which are walking, standing, jogging, sitting, biking, upstairs, and downstairs. The experiments were conducted in the university buildings, except for biking. The "walking" and "jogging" activities were performed in the department corridor, while the "upstairs" and "downstairs" activities were performed in a 5-floor building. Samples were obtained from four sensors (accelerometer (A), gyroscope (G), magnetometer (M), and linear acceleration (LA)) with a 50 Hz sampling rate using Samsung Galaxy SII (i9100) smartphones. An Android application was developed to gather the data from the sensors. It was possible to work on the data fusion and sensor fusion concepts because the data were obtained from four different sensors. Those sensors were placed on the left, and right pockets, belts, upper arms, and wrists of the individuals. The first three sensor locations are the most common ones because humans mostly carry their smartphones at one of those locations. The fourth position (upper arm) is frequently used when performing activities, such as jogging. The fifth position (wrist) simulates a smartwatch or a smart wristband. For each activity, the data were gathered for all five body locations simultaneously. The dataset was collected from 10 participants, who were male and between the ages of 25 and 30. Each participant performed each activity for 3-4 min. Given 5 sensor positions and 10 participants, we used 50 datasets.
Each sensor obtains three-dimensional values: x, y, and z-axis. The accelerometer sensor measures gravitational acceleration in meters per second squared (m/s 2 ), while the gyroscope sensor provides the rotation rate for each of the three axes in radians per second (rad/s), and the magnetometer sensor reports the magnetic field in microtesla (μT) along each axis.
The SAR dataset is in the form of raw data, including the three axial values (X, Y, Z) measured by each sensor (A, G, M, and LA). In the data preparation stage of this study, the following three steps were performed: (i) data pre-processing-the signal magnitude vector was calculated using these three axes components, and a low pass filter with a 0.6 Hz cut-off frequency; which was applied to isolate the gravity component and then to obtain the body component; (ii) segmentation-the sliding window method with 1 s window length and an overlap of 50% was used to segment the sensor data; and finally (iii) feature extraction-both single and multi-axial, as well as frequency-domain and time-domain features were separately extracted from four sensor data (A, G, M, and LA), including the number of zero crossings, RMS velocity, crest factor, RMS value, kurtosis, skewness, peak-to-peak value, and signal entropy. Thus, 112 features were obtained for each sensor type.
To convert single-instance data to MI data, we implemented a bag-of-sensors generator. The generator represents each action as a bag of four instances, where each instance is a feature vector extracted from different sensor data obtained at a particular period. Thus, each sensor data is represented by a feature vector and referred to as an instance, and multiple sensor data is considered a bag of instances.

| Experiment 1: Impact of participant
In this experiment, the proposed HAR-MIL method was evaluated in terms of participant factors. We compared the recognition performance from two different view points: the sensor data comes from a single participant (personalized classifier) or all participants (generalized classifier). In the first case, each participant has his/her own classification model, implying that the dataset of each participant was used to build a classification model. However, in the second case, all participants' data are combined to construct a general classifier.
In the personalized version, the classification models are trained for a specific participant with his or her own data. Participant-based accuracy results of MIL algorithms with different base learners and sensor locations are shown in  These results show that it is possible to achieve high and stable recognition accuracies using the proposed HAR-MIL method for both personalized and generalized cases. The average accuracy value of each participant is different because of the differences in human behaviour. According to the results, the personal accuracies of participants were between 95.82% and 99.58% for SimpleMI, and 96.34% and 99.65% for MIWrapper.
However, the overall recognition accuracy (94.53% and 93.92%) for all participants was slightly lower than that for individual ones. The results show that combining all participants' data as a single dataset decreased the classification accuracy by approximately 3% on average.

| Experiment 2: Impact of algorithms
In the second experiment, we evaluated the performances of two MI classification algorithms with various base classifier combinations. The classification accuracies obtained using the HAR-MIL approach depend on the classification methods being used (Figure 4). According to the results, the best accuracy was obtained for SimpleMI using the NN algorithm (99.04%), whereas it was achieved for MIWrapper using the RF algorithm (99.13%). SimpleMI + SVM combination also achieved very promising results. MIWrapper + AB combination demonstrated above 98% accuracy for all sensor positions which is because AdaBoost is an ensemble ML technique that combines multiple weak learners to form a strong learner.
The results indicate that MIWrapper has slightly higher classification accuracies compared to SimpleMI.
The performance of the proposed method was compared with one of the powerful classification methods, known as extreme learning machine (ELM), which has been used for HAR in recent studies (Patil et al., 2019). The parameter tuning for ELM was performed considering alternative activation functions (sigmoid, sinusoidal, ReLU) and the number of hidden neurons (10,25,50,100,250,500,1000). The best result was obtained when using 500 hidden neurons with the sigmoid activation function. Comparing the results of ELM (94.57%) and HAR-MIL (99.32%), the proposed approach outperformed ELM for the SAR dataset. The confusion matrix obtained using the ELM is shown in Figure 5. The stable activities (standing and sitting) were perfectly classified using ELM, but it was difficult to distinguish upstairs and downstairs activities.

| Experiment 3: Impact of sensor position
Wearable sensors can be placed on various body parts such as the waist, arm, or wrist. Determining the ideal body location, where the sensor will be placed, is a research-worthy problem in HAR because the sensor location considerably affects the activity recognition accuracy. The sensor location is crucial because activity dynamics may vary depending on the sensor position. For example, the sensor placement on the thigh contributes to high accuracy for jogging activity; however, the same position of the sensor is probably not useful for hand-oriented activities, such as clapping or writing. For instance, a sensor placed on a thigh produces several motions when the participant is walking, resulting in a significant fluctuation of the signal, thus the signal includes a sequence of valleys and peaks because the leg forms a series of downward and upward F I G U R E 3 Comparison of personalized (participant-based) and generalized (all-participants-based) classifiers in terms of average classification accuracy.
movements when walking. However, when the sensor is placed at the chest, the sensor signal reflects the movement of the torso. Therefore, the role of SP should be investigated in detail.
In this experiment, we evaluated the performance of the HAR-MIL method for five different SP on the body, including the left pocket, right pocket, belt, upper arm, and wrist. The first three positions are frequently used by humans when carrying smartphones. When the smartphone is in the pants pocket, the signal tends to reflect the movement of the leg. The fourth position of the sensor is commonly used when activities, such as biking are performed. The fifth position simulates a smartwatch or a smart wristband. Therefore, placing smartphones in one of these five positions does not affect the physical activities of the individual in any way. Smartphones were used to collect data for all five body locations simultaneously for each human activity. The phones were oriented portraits for the two pockets, wrist, and upper arm for the belt position. Note that the locations for these phones on users' bodies are fixed. After performing an activity, five smartphones were removed from the participant's body and the Android applications were stopped. This situation caused some abnormal signals (noise) at the start and end of each activity data. These noisy parts of the data were removed before analysis.  It is also possible to combine multiple sensors from different body positions; however, this could cause complex sensor placement, higher cost, deployment difficulty, computational complexity, and obtrusiveness for individuals. In addition, the sensors spread over the body hinder the user from performing daily living activities, which may result in the individual rejecting wearing them. As a result, the placement of sensors in a single position is more appropriate for the recognition of basic human activities, such as walking, upstairs, and downstairs. Therefore, in this study, we used the sensor data obtained from a single body position each time when training and testing classifiers. Figure 6 shows the comparative results for different sensor positions (SP). We plotted the accuracies on both dimensions to investigate the relative performance of different classification algorithms for different sensor locations. The results showed that the proposed HAR-MIL approach does not give the same accuracy rate based on the different SP. Therefore, the results indicated that the sensor position on the body is a significant issue to consider to capture correct body movements. According to the results, the highest average accuracy for both SimpleMI and MIWrapper was achieved using a belt as the smartphone position, 97.72%, and 98.29%, respectively. Thus, the most discriminant sensor location determined using the HAR-MIL method is the belt. The SimpleMI + NN combination achieved very high accuracy values, of more than 98% at all positions. Similarly, the MIWrapper + RF algorithm is the most effective method for recognizing various types of daily activities because it achieved the best accuracy on an average (98.76%) when considering all SP. The results showed that MIWrapper has slightly higher classification accuracies compared to SimpleMI. In addition, note that the recognition performance trends are similar for the left and right pocket locations.
They can be used alternatively when recognizing daily human activities because they perform almost equally in many situations. All HAR-MIL algorithms achieved good performance for all sensor locations, that are above 93.94%.

| Experiment 4: Impact of sensor type
HAR systems typically use only one sensor type, known as an accelerometer. However, a single sensor type cannot deal with semi-complex or complex human activities in practice. This is why we investigate multiple sensory HAR applications. Different sensor types provide diverse and abundant information and varied perspectives for specific human activities. For example, an accelerometer can provide information about the body translation, and a gyroscope can measure the orientation of the motion; thus, the combination of data obtained from these sensors is usually required to recognize complex human activities correctly. The combination of multiple sensor types can provide useful information about human activities, thereby improving the classification performance of a HAR system. However, using multiple sensors could increase the complexity and cost of the system compared with a single sensor type. Therefore, instead of using all available sensors, the selection of an ideal sensor combination is required to minimize data transmission overheads on a limited network.
In this experiment, we evaluated the performance of the HAR-MIL method for different sensor configurations, including A + G + M + LA, A + G + M, A + G, A + M, and G + M. This analysis was used to determine how many and which motion sensors in combination are ideal to be used for better recognition performance. We reduced the cost, complexity, and data transmission load by evaluating the performances of various sensor combinations in activity recognition to produce results comparable to the use of A + G + M + LA. Figure 7 compares different sensor combinations when the RF algorithm is used with MIL algorithms. We used the RF algorithm due to its superiority in the previous experiments. We plotted the accuracies on both dimensions to investigate the relative performance of different sensor combinations for different sensor locations. According to the results, the best sensor combination on average is A + G for SimpleMI (97.16%), whereas it is A + G + M for MIWrapper (98.09%). The results showed that the accelerometer and gyroscope play lead roles when recognizing multi-instance-based human activities. Among the four sensor types, A + G attracts more attention because of its low cost, and flexibility in daily activities, and satisfying recognition performance. The highest accuracy (98.65%) was achieved by MIWrapper using the belt as the smartphone position. The results showed that MIWrapper has slightly higher classification accuracies than SimpleMI.

| Human activity evaluation
A confusion matrix is a common method for visualizing the performance of a classifier for each class. The rows and columns in the matrix represent the actual and predicted classes, respectively. An element (M ij ) in a confusion matrix M is the number of instances that belong to class i but are recognized as class j. The nondiagonal elements represent misclassified compounds, whereas the diagonal elements (M ii ) represent the number of instances correctly classified for each class.  Figure 8 shows the confusion matrices for the multi-instance-based HAR task for each sensor location.
The results were given for the NN algorithm and the fourth participant because high results were obtained with them in the previous experiments. According to the results, the classifiers were better at recognizing the standing, sitting, and biking activities. Furthermore, the generated classifiers can correctly distinguish very similar activities, such as downstairs and upstairs. The confusion matrices showed that the most confusing actions are walking and upstairs when using SimpleMI, whereas they are downstairs and jogging for MIWrapper. In terms of sensor locations, the belt position mostly takes the leading role in recognizing human activities.
The results show that all classifiers provide a certain degree of accuracy. Because both SimpleMI + NN and MIWrapper + NN demonstrated above 98% accuracy for all SP, they can be used alternatively when recognizing human activities of daily living. These results showed that by using HAR-MIL, we can distinguish the activities of daily living very well using HAR-MIL.
To show the superiority of our method, we compared it with the state-of-the-art methods in the literature. The related studies that used the SAR dataset (Shoaib et al., 2014) along with the techniques and the corresponding classification accuracies are shown in Table 6. The bold value denotes the highest model accuracy. The proposed HAR-MIL approach outperformed the other methods with a 10% improvement on average Table 6. Employing HAR-MIL (SimpleMI + NN) for sensors located in the belt position achieved better accuracy (99.32%) than the state-of-the-art models on the same dataset.

| DISCUSSION
In this study, the performance of multi-instance learning-based human activity recognition (HAR-MIL) was investigated on daily activities (walking, standing, jogging, sitting, biking, upstairs, downstairs)   The main results of this research are concluded as follows: • The highest accuracy value was obtained by SimpleMI + NN for belt location as 99.32%. Therefore, the proposed HAR-MIL method outperformed the state-of-the-art methods by 10% on average on the same dataset.
• The overall average accuracy was approximately 97% when considering all participants, sensor locations, and classification algorithms.
• Participant-based (personalized) training achieves higher classification accuracy than all-participant-based (generalized) training because of the differences in human behaviour. Combining all participants' data as a single dataset decreased the classification accuracy by approximately 3% on average.
• The best classification results are usually obtained by MIWrapper; however, SimpleMI also classifies correctly as MIWrapper does. Although MIWrapper usually gives approximately 1% higher accuracy results than SimpleMI, its execution time takes approximately ten times longer than that of SimpleMI.
• The highest average accuracy results considering all sensor locations were obtained by the combinations of SimpleMI + NN (98.49%) and MIWrapper + RF (98.76%). Thus, using the NN algorithm or the RF algorithm can provide better performance among the seven base learners.
• When the sensor position is considered, the belt (97.72%), left pocket (97.23%), right pocket (96.88%), wrist (96.39%), and upper arm (96.33%) were the most effective to least effective locations for SimpleMI on average. MIWrapper has the same average ranking as SimpleMI, which is the belt (98.29%), left pocket (98.14%), right pocket (97.76%), upper arm (97.24%), and wrist (97.06%). The belt position yielded the highest accuracy, followed by the left pocket position. Therefore, the belt position can be successfully considered as the best sensor location when working on MI HAR data to recognize various types of human activities. However, different HAR systems could place sensors on different parts of the human body targeting specific aims, activities, and applications, instead of a belt.
• On average, the best sensor combination for SimpleMI (97.16%) is the accelerometer and gyroscope (A + G), whereas that for MIWrapper (98.09%) is the accelerometer, gyroscope, and magnetometer (A + G + M). These results show that the accelerometer and gyroscope play lead roles when recognizing multi-instance-based human activities. Therefore, instead of using all available sensors, the accelerometer and gyroscope can be used together to minimize cost, complex sensor placement, data transmission load, computational complexity, and obtrusiveness for individuals. In addition, finding the optimal HAR sensor combination may decrease the cost of the design, management, and maintenance of wearable devices.
• When considering the type of human activity is considered, the classifiers are better at recognizing the standing, sitting, and biking activities.
Furthermore, the generated classifiers can correctly distinguish very similar activities, such as downstairs and upstairs. In terms of sensor locations, the belt position mostly takes the leading role in recognizing human activities.
These results demonstrated that the proposed HAR-MIL approach can appropriately classify human activities of daily living activities because it usually provides very high accuracy values. The results obtained in the four experiments in this study showed the feasibility of our method in terms of cost and applicability.
In this study, we demonstrated the results of MIL-based HAR for some specific scenarios, such as algorithm combinations, sensor positions, and sensor combinations. However, these results may differ if different MIL algorithms, different sensor positions, or feature sets are used. Furthermore, we left the input parameters of the classification algorithms as their default values, except for KNN and AdaBoost. Changing the default parameters may lead to different classification results. Moreover, in this study, the orientation of the smartphone is fixed, and changing the orientation may also lead to different results.
Using MIL for HAR has several advantages. Because multiple sensors are strongly related to each other, it can be useful to unify their data in a MI manner to reach a final consensus. In this way, multi-sensors formed as MI can significantly improve recognition. In addition, using the MIL may reduce the negative effects of classification accuracy caused by sensors that are out of order for some reason. The proposed method can be used in many fields and for many applications such as exercise tracking, physiotherapy and rehabilitation studies, entertainment, gait analysis, and fall detection.

| CONCLUSIONS AND FUTURE WORK
HAR is challenging because of the numerous ways that an activity can be represented. To accurately identify activities from the data collected using different sources, it is crucial to represent the activity properly. In addition, it is essential to minimize bias due to missing data, faulty sensors, or measurement errors. This study proposes an approach, "HAR-MIL" which combines HAR and MIL concepts to overcome such issues and eliminate the limits of a traditional single-instance representation by bringing multi-instance representation. Unlike traditional HAR approaches, the proposed method can handle multiple sensor data simultaneously, with instances obtained from several sensors appearing together in a bag to define an activity. For the first time, two multi-instance algorithms (SimpleMI and MIWrapper) with different base learners (DT, SVM, NN, PART, KNN, RF, AB) were used in the field of wearable sensor-based HAR. It is the first study to explore the effects of sensor positions (left pocket, right pocket, belt, upper arm, and wrist), sensor combinations (A + G + M + LA, A + G + M, A + G, A + M, and G + M), and the number of participants (single or multiple) on the MI representation of the relevant activity. The performance of the proposed approach was investigated in daily human activities, such as upstairs, downstairs, walking, standing, jogging, sitting, and biking.
In the experimental studies, the effectiveness of the proposed HAR-MIL method was demonstrated on 50 participant-based and sensor-position-based activity recognition datasets. The results showed that the proposed method is effective for wearable sensor-based HAR with high classification accuracy (99.32%). Furthermore, the results showed that our method outperformed the state-of-the-art methods by 10% on average on the same dataset.
In the future, the performance of MIL-based HAR can be investigated on a dataset, which includes specific human activities, such as transportation, sports, military, and hand-oriented activities. Moreover, the impact of smartphone orientation (i.e., landscape or portrait) on the human body can be explored in a MI manner. Like many of the existing HAR approaches, the proposed method is position-dependent. However, it requires a phone to be placed at some fixed location on the body. In daily living activities, people do not always keep their phones in a fixed location. Therefore, a MIL-based HAR model can be trained on the sensor datasets of the same activity for different phone locations at the same time.
In future works, the HAR-MIL approach can be investigated by considering synchronized multi-activity recognition, activity transition recognition, and age-specific (i.e., young, adult, senior) HAR. Furthermore, the concept of data fusion can be combined with the HAR-MIL approach by taking advantage of its positive aspects.