Validation and Comparison of Accelerometers Worn on the Hip, Thigh, and Wrists for Measuring Physical Activity and Sedentary Behavior

Background Recent evidence suggests that physical activity (PA) and sedentary behavior (SB) exert independent effects on health. Therefore, measurement methods that can accurately assess both constructs are needed. Objective To compare the accuracy of accelerometers placed on the hip, thigh, and wrists, coupled with machine learning models, for measurement of PA intensity category (SB, light-intensity PA [LPA], and moderate- to vigorous-intensity PA [MVPA]) and breaks in SB. Methods Forty young adults (21 female; age 22.0 ± 4.2 years) participated in a 90-minute semi-structured protocol, performing 13 activities (three sedentary, 10 non-sedentary) for 3–10 minutes each. Participants chose activity order, duration, and intensity. Direct observation (DO) was used as a criterion measure of PA intensity category, and transitions from SB to a non-sedentary activity were breaks in SB. Participants wore four accelerometers (right hip, right thigh, and both wrists), and a machine learning model was created for each accelerometer to predict PA intensity category. Sensitivity and specificity for PA intensity category classification were calculated and compared across accelerometers using repeated measures analysis of variance, and the number of breaks in SB was compared using repeated measures analysis of variance. Results Sensitivity and specificity values for the thigh-worn accelerometer were higher than for wrist- or hip-worn accelerometers, > 99% for all PA intensity categories. Sensitivity and specificity for the hip-worn accelerometer were 87–95% and 93–97%. The left wrist-worn accelerometer had sensitivities and specificities of > 97% for SB and LPA and 91–95% for MVPA, whereas the right wrist-worn accelerometer had sensitivities and specificities of 93–99% for SB and LPA but 67–84% for MVPA. The thigh-worn accelerometer had high accuracy for breaks in SB; all other accelerometers overestimated breaks in SB. Conclusion Coupled with machine learning modeling, the thigh-worn accelerometer should be considered when objectively assessing PA and SB.

than sitting/lying [19,20]. Furthermore, the MVPA threshold for accelerometer counts/min is considerably different for ambulatory activities compared to non-ambulatory activities, rendering any single cut-point inaccurate for assessing MVPA [21]. An accurate measurement tool for SB needs to differentiate between SB and standing or other LPA; the measurement tool must also be able to classify intensity of both ambulatory and non-ambulatory activities for MVPA.
Due to limitations of the cut-point approach to measuring PA intensity categories, researchers have utilized machine learning models to improve accuracy of PA measurement. Studies show improved measurement of energy expenditure, accurate classification of activity type, and correct classification of PA intensity category from a hip-mounted accelerometer [22][23][24]. Despite the common use of hip-mounted accelerometers, there are advantages, such as improved comfort and compliance, of wearing accelerometers on other body locations. Research shows that machine learning modeling has dramatically improved measurement accuracy of accelerometers worn on various body locations, such as the wrist and thigh [23,25]. Additionally, accelerometers worn on the wrist and thigh have shown strong ability to detect specific activities and have yielded acceptably accurate assessments of energy expenditure and SB [23,26,27]. However, these accelerometer placements have not yet been tested for assessment of PA intensity category when coupled with machine learning models. Therefore, the purpose of our study was to develop, validate, and compare the accuracy of hip-, thigh-, and wrist-worn accelerometers, coupled with machine learning models, for measuring 1) total time spent in SB, LPA, and MVPA intensity categories, and 2) breaks in SB in a semi-structured setting.

Study participants
Study participants (n = 44, 50% female) were recruited from the surrounding area of East Lansing, MI. Eligible participants were able to perform MVPA safely, did not have orthopedic limitations, were 18-44 years old and could read and speak the English language. Written, voluntary informed consent was obtained from all participants, and this study was approved by the Michigan State University Institutional Review Board.

Equipment
Each participant wore four accelerometers in this study. Two ActiGraph GT3X+ accelerometers (ActiGraph LLC, Pensacola, FL) were worn, one on the midline of the right thigh (adhered with hypoallergenic tape), and one above the right hip at the anterior axillary line (secured on elastic hip belt). Participants also wore two GENEActiv accelerometers (Activinsights Ltd, Kimbolton, Cambridgeshire, UK), one on the dorsal side of each wrist using a manufacturer-supplied watch strap. A sampling frequency of 20 Hz was chosen for the GENEA accelerometers for comparison to another accelerometer tested (not used in this analysis). ActiGraph accelerometers have a minimum sampling frequency of 30 Hz; therefore, 40 Hz was chosen for sampling frequency, which was reintegrated to 20 Hz after downloading data. A portable metabolic analyzer (Oxycon Mobile; CareFusion, San Diego, CA) was worn by participants during the study, but these data were not utilized for the current analysis.

Procedure
Upon arrival at the Human Energy Research Laboratory, each participant's weight and height were then measured (to the nearest 0.1 kg and 0.1 cm, respectively) according to standardized methods [28]. Handedness was determined by asking participants which hand they prefer to use for the majority of everyday activities, and participant age was determined from self-reported date of birth.
After being fitted with the four accelerometers, participants performed 13 activities of different types and intensities that encompassed types of activities that individuals may perform in a free-living environment (Table 1). Participants performed the activities for a total of 90 minutes. They performed each of the activities for between 3-10 minutes each, in the order and exact duration of their choosing. Ambulatory activities (walking and jogging) are common in accelerometer validation literature; we included these but added sedentary, exercise, and lifestyle activities to determine the potential for the four accelerometer placements to measure different PA intensity categories accurately in a semi-structured, simulated free-living setting. The 13 activities were described to each participant prior to the start of the protocol, and some of the less familiar activities (e.g., squats) were demonstrated to ensure understanding. Additionally, the research assistant performing direct observation (DO) updated participants periodically on which activities they still needed to complete.

Creation of models to predict PA intensity category
From the raw accelerometer data, percentiles (10 th , 25 th , 50 th , 75 th , and 90 th ) were extracted for each accelerometer axis for each 30-second window of data, and the extracted features were used as inputs for machine learning model development. No filtering of the raw accelerometer data was conducted prior to feature extraction. The 30-second window length was chosen partially to time-match with portable metabolic analyzer data; additionally, 30-second windows were chosen rather than 60-second windows due to non-steady-state nature of the protocol. Artificial neural networks (ANNs), a popularly tested modeling technique for predicting activity type and energy expenditure from accelerometer data, were developed for this study [24,29,30]. The ANNs were created to categorically classify all activity into one of three intensity categories: SB, LPA, or MVPA. The decision not to further differentiate MVPA into moderate-or vigorous-intensity PA was made to maximize accuracy of the ANN models, which have lower classification accuracy as more classification categories are added [31][32][33]; additionally, for researchers and practitioners interested in assessing health-enhancing PA rather than specific intensities, further differentiation of higher-intensity activities may not always be necessary. A graphical depiction of the ANNs created in the current study is shown in Figure 1. The ANN function outputs a probability between 0-1 of each PA intensity category. Values closer to one represented a higher likelihood of that intensity category, and the intensity category with the value closest to one was chosen as the predicted output by the ANN. Breaks in SB from the accelerometers were determined as a 30-second window of LPA or MVPA which followed time classified as SB. The ANNs created in this study contained 1 hidden layer and 5 hidden units. Skip-layer connections were not allowed in the ANN.
The ANNs were created and tested using a leave-one-out cross-validation. In this approach, data from all but one participant were used to estimate the weights for each input feature for predicting PA intensity category. Then, the ANN was tested on the data from the participant left out of the training phase by supplying the input features and comparing the predicted PA intensity category from the ANNs to the recorded PA intensity category from DO. The leave-one-out cross-validation is an iterative approach and was repeated with each participant's data used as the testing data once. This process was conducted separately for each accelerometer, resulting in four distinct ANNs. The ANNs developed can be found at the following link: https://drive.google.com/open?id=0B-BgdTzyd2OxazA1UE1zcTFEY1k.

Direct observation
DO served as the criterion measure for time spent in SB, LPA, MVPA, and the number of breaks in SB for the current study. Activities performed were recorded continuously and in real time on a handheld personal digital assistant using the BEST (Observerware, Hobe Sound, FL) observation software. Using this software, activities performed during the visit were coded as 1 of the 13 activity types listed in Table 1. Research assistants were trained to record an activity change as closely as possible to the moment it occurred. After collection, DO data were synchronized with the accelerometer data so that each 30-second window of accelerometer data was matched to the actual activity performed during that window. In most cases, only one activity occurred during a given 30-second window. However, when transitioning between activities, two activities could occur in the same window. If this occurred, the window was automatically recoded as a transition. Additionally, the transition category was used to define all time between activities, such as resting or making an equipment adjustment between activities. Thus, transitions did not represent a specific activity type but instead involved walking, standing, etc. that occurred at the end of one activity and before the next started. All data coded as a transition were included when determining breaks in SB but removed from the DO and accelerometer datasets prior to creation and testing of the ANNs for the determination of time spent in each PA intensity category. From the DO data, activities were classified into one of three intensity categories (SB, LPA, or MVPA) based on MET values published in the Compendium of Physical Activities [34]. For DO, transitions from SB to a non-sedentary activity were summed to obtain total breaks in SB during the protocol. Five research assistants collected DO data during the study. Pilot testing demonstrated inter-researcher reliability of r > 0.92 across all research assistants for assessment of total time in each activity type.

Statistical analyses
Sensitivity and specificity were calculated for SB, LPA, and MVPA for each accelerometer and each participant. Sensitivity was calculated by assessing the proportion of instances of a certain intensity category correctly classified as that category by the accelerometer (e.g., the proportion of instances where the participant was performing SB that the thigh-worn accelerometer correctly classified the activity as SB). Specificity was calculated as the proportion of instances where an intensity category was not performed and was correctly classified as not performing that intensity category (e.g., the proportion of time an individual was not performing SB [i.e., was performing LPA or MVPA] that the thigh-worn accelerometer correctly classified the activity as not SB). Repeated measures analysis of variance tests were used to compare sensitivity and specificity among accelerometers. Confusion matrices were created to assess misclassification of PA intensity category, and weighted Kappa statistics (with quadratic weights and equal weighting for relative distance) were calculated and compared among accelerometer placements. Additionally, repeated measures analysis of variance was conducted to compare total time spent in SB, LPA, and MVPA predicted from each accelerometer to time measured by DO. For breaks in SB, criterion-measured breaks were also obtained for each participant using DO. Differences among DO and the four accelerometers were evaluated with repeated measures analysis of variance. If significant differences were revealed in any of the repeated measures analysis of variance tests, post hoc dependent t-tests were conducted with a least significant difference correction. An alpha level of p < 0.05 was used to determine statistical significance. We desired 90% power to be able to detect significant differences for an effect size of 0.5 among accelerometers for sensitivity and specificity and for predicted and measured time spent in each PA intensity category and breaks in SB. At an alpha level of p = 0.05, 36 subjects were required. Therefore, our sample size of 44 provided sufficient power to address our research questions.

Results
Of the 44 participants who participated in study, accelerometer malfunction occurred during one participant's visit, resulting in exclusion from the data analysis. Additionally, the portable metabolic analyzer (used to address an aim not part of the current study) malfunctioned in three participants, resulting in premature termination of the protocol and exclusion of their data from the analysis. Demographic characteristics of the 40 participants included in data analysis are displayed in Table 2. Approximately 25% of the sample was classified as overweight (≥ 25.0 kg/m 2 body mass index), and 90% of the sample reported being right-hand dominant. Sensitivity and specificity for classification of each PA intensity category are displayed in Table 3. For SB, the thigh-worn accelerometer had significantly higher sensitivity and specificity than the rest of the accelerometers, and the left wrist-worn accelerometer showed higher sensitivity and specificity than the hip-and right wrist-worn accelerometers. For LPA, the thigh-worn accelerometer had significantly higher sensitivity and specificity than the hip-and right-wrist-worn accelerometers, and the left wrist-worn accelerometer had significantly higher sensitivity and specificity than the hip-worn accelerometer. For MVPA, the thigh-worn accelerometer had significantly higher sensitivity and specificity than all other accelerometers, and the left wrist-worn accelerometer had higher sensitivity and specificity than the hip-and right wrist-worn accelerometers; conversely, the right wrist-worn accelerometer had a lower sensitivity and specificity than all other accelerometers. To further examine PA intensity category prediction, confusion matrices were created for PA intensity classification by each accelerometer, as shown in Figure 2. For the hip-worn accelerometer, most misclassifications were by a single intensity category, with only 6 instances (0.4%) of SB misclassified as MVPA and 44 instances (1.9%) of MVPA misclassified as SB. A similar scenario was present for the thigh-worn accelerometer (1 instance [< 0.1%] of SB misclassified as MVPA and 0 instances of MVPA misclassified as SB) and the left wrist-worn accelerometer ( Table 4. Significant differences in Kappa statistics were seen among all four accelerometers, with almost perfect PA intensity category classification from the thigh-worn accelerometer (K = 0.99), very good classification accuracy from the left wrist-worn (K = 0.95) and hip-worn (K = 0.90) accelerometers, and good classification accuracy of the right wrist-worn accelerometer (K = 0.78) [35].   Predictions of total time spent in SB, LPA, and MVPA among accelerometers and the criterion measure (DO) are shown in Figure 3. SB estimated by thigh-worn accelerometer was the same as DO-measured SB for 37 of the 40 study participants, resulting in a mean SB predicted for the entire sample that was not significantly different from DO-measured SB. Additionally, predicted time spent in LPA and MVPA by the thigh-worn accelerometer was not significantly different from DO.  Error bars represent standard deviation. * Indicates significant differences from the criterion measure (direct observation).
Breaks in SB as predicted by the accelerometers and measured by DO are shown in Figure 4. Breaks in SB predicted by the thigh-worn accelerometer were the same as DO-measured breaks in SB for each of the 40 participants in the study, resulting in the same mean and standard deviation of the data as DO (i.e., was not significantly different from DO). Conversely, breaks in SB were overestimated by the hip-worn accelerometer (mean difference 1.4 breaks [70%], p < 0.01), left wrist-worn accelerometer (mean difference 0.4 breaks [20%], p < 0.01), and right wrist-worn accelerometer (mean difference 2.7 breaks [135%], p < 0.01).

Figure 4. Predicted vs. measured breaks in SB.
Error bars represent standard deviation. * Indicates significant differences from the criterion measure (direct observation).

Discussion
This study's purpose was to test accelerometers worn on the hip, thigh, and wrists (coupled with ANN models) for prediction of time spent in PA intensity categories (SB, LPA, and MVPA) in 30-second windows as well as breaks in SB. While previous research has shown strong utility of ANNs and other machine learning models for assessment of energy expenditure and/or recognizing specific activity types, to our knowledge our study is the first to use machine learning models for several accelerometers specifically to assess time spent in PA intensity categories. The assessment of energy expenditure in terms of Calories or METs has proven difficult, with most current research in this area showing prediction errors (i.e., root mean square error) of > 1.0 MET, which represents a large error when the average daily MET level for adults is < 1.5 METs [23,36,37]. While the assessment of PA into three distinct intensity categories (SB, LPA, and MVPA) is more crude than predicting Calories or METs, assessment of time into PA intensity categories is simpler to model and still allows for valuable information regarding individuals' PA levels, adherence to PA guidelines, and associated health implications.
Our study found high accuracy of the thigh-worn accelerometer for predicting time spent in each PA intensity category, as seen by sensitivities and specificities > 99% for correctly classifying each PA intensity category and no differences from the DO in predictions of total time spent in any category; additionally, the thigh-worn accelerometer correctly predicted the number of breaks in SB for all 40 participants in the study, providing evidence that the thigh-worn accelerometer has high utility for detecting temporal changes in PA and SB. Previous research also shows high accuracy of thigh-worn accelerometers for assessing time spent in SB and breaks in SB [14,15,27], and our study extends these findings by showing that thigh-worn accelerometers are also capable of accurately assessing time spent in LPA and MVPA, thereby demonstrating the thigh-worn accelerometer's utility to assess time spent in SB and health-enhancing PA (i.e., MVPA). The superior accuracy of the thigh-worn accelerometer supports previous research showing high accuracy of a thigh-worn accelerometer for measurement of energy expenditure and activity type classification [23,29,31]. Due to their consistently high accuracy for measuring several different PA constructs and sleep [38], continuous wear (primarily using adhesive tape) for high compliance, and the miniaturization of accelerometer devices, thigh-worn accelerometers have strong potential for assessment of a number of health-related constructs.
Our study also indicated high PA intensity category classification accuracy and prediction of SB for the left wrist-worn accelerometer, achieving sensitivities above 90% and specificities above 95% for all PA intensity categories and no difference from DO for total SB. The left wrist-worn accelerometer did especially well at lower intensity categories, achieving sensitivities and specificities above 97% for SB and LPA. The right wrist-worn accelerometer also had high sensitivities and specificities for SB and LPA (> 93%), but both sensitivity and specificity were significantly lower for MVPA, which was commonly misclassified as LPA. Additionally, the right wrist-worn accelerometer underestimated MVPA but overestimated time spent in SB and LPA and had a more pronounced overestimation of breaks in SB than the left wrist-worn accelerometer. Our findings are supported by those of Esliger et al., who performed the initial validation of the GENEA accelerometer in 2011 and also found higher classification accuracy of PA intensity category using cut-points for a left-wrist accelerometer compared to a right-wrist accelerometer [39]. Given that 90% of our sample reported the left hand being their non-dominant hand, our findings provide evidence that an accelerometer worn on the non-dominant wrist may have better utility for assessing PA intensity category and SB than a dominant wrist-worn accelerometer. One reason for lower accuracy when an accelerometer is worn on the right wrist may be lack of familiarity, since most wristwatches are designed to be worn on the left hand, which could influence movement patterns when performing an activity. Another reason may be the greater variability of movement of the dominant hand during everyday activities. Regardless, these studies provide evidence that studies assessing PA intensity may prefer to use the non-dominant wrist for accelerometer wear. Pavey et al. and Rowlands et al. provide further rationale for using an accelerometer worn on the non-dominant wrist for assessing time spent in SB [26,40,41]. These findings also support the choice for wearing accelerometers on the non-dominant wrist in the 2011-2014 NHANES data collection cycle [42].
The hip-worn accelerometer, while providing high specificity across all PA intensity categories, had lower sensitivity for correctly classifying PA intensity category and frequently misclassified SB and MVPA as LPA, resulting in underestimations of time spent in SB and MVPA but an overestimation of time spent in LPA and in breaks in SB. Previous studies by Lyden et al. and Kozey-Keadle et al. demonstrate difficulties in assessing time spent in SB and breaks in SB with hip-worn accelerometers using traditional cut-point methods, and our study adds to these findings by illustrating misclassification of SB as a non-sedentary activity (and vice versa) with machine learning models [14,15]. Given the similar hip angle and minimal movement present in most types of SB and for non-sedentary activities such as standing, hip-worn monitors appear limited in their capacity for assessing SB. It should be noted that sensitivities and specificities were above 86% and predicted time in each intensity category was within 15% of criterion measure with the hip-worn accelerometer, indicating reasonable accuracy for the hip-worn accelerometer. However, with the high performance of the thigh-and left wrist-worn accelerometers for assessing SB as well as higher-intensity PA, these alternative accelerometer locations, when coupled with machine learning modeling, may be preferable to hip-worn accelerometers for assessment of PA intensity.
There are a number of factors to consider when employing the developed ANNs from our study. First, these ANNs were developed in a semi-structured setting and require cross-validation in a free-living environment. Additionally, most studies use hip-worn accelerometers and cut-points for data analysis, and use of alternate placements and machine learning may affect comparability of PA estimates across studies. Additionally we chose to use 30-second windows due to the non-steady-state nature of the protocol. A 30-second window is shorter than the 60-second window typically used for applying cut-points to accelerometer data in studies assessing PA in adults, and studies evaluating window (or epoch) length show that use of different epoch lengths affects estimations of time spent in different PA intensities [43,44]. Therefore, the use of alternate accelerometer placements vs. hip-worn accelerometers, machine learning vs. cut-points, and 30-second vs. 60-second windows should be considered when comparing estimates of PA from the developed ANNs to estimates derived from cut-point approaches with hip-worn accelerometers.
This study had several strengths. First, the semi-structured setting allowed for considerable freedom in the manner in which participants performed their activities, thereby improving the generalizability of the ANN models created. The most common accelerometer placement sites (hip, thigh, and wrist) were included in this study, allowing for direct comparison of their utility for assessment of PA intensity category. Moreover, comparison of left and right wrists provided insight into which wrist may be preferable for accelerometer wear. Finally, classification into PA intensity categories, as was performed in this study, allows for reasonably accurate assessment of PA patterns while avoiding the difficulties associated with predicting energy expenditure or activity type using accelerometers.
This study also had several limitations worth considering. The study sample consisted mainly of younger adults, and the ANN models developed in this study need further validation before use in an older or more diverse population. Additionally, while we consider the semi-structured setting a study strength, the study included a limited set of activities, and the ANN models developed in this study may not correctly classify intensity of other activities, such as sports, which were not evaluated in this study.
DO was used as a criterion measure of PA intensity category, which does not directly measure the energy cost of an activity but, rather, classifies it based on the Compendium of Physical Activities [34]. Our choice for using DO instead of directly-measured energy expenditure (e.g., via a metabolic analyzer) to characterize PA intensity category was predicated on the non-steady-state nature of the study protocol. Periods of SB following higher-intensity PA may have a falsely elevated energy cost (and vice versa for higher-intensity PA following SB) as the body's metabolic processes lag behind the current energy cost of the activity, and this would be problematic for assessing PA intensity category using a metabolic analyzer. To illustrate this point, a previous study by our research group using the same participants and activity protocol found that 29.5% of the time participants spent lying, reading, and using the computer (sedentary activities) elicited an energy expenditure > 1.5 METs and 3.3% of the time elicited an energy expenditure ≥ 3.0 METs (measured with a metabolic analyzer), which would incorrectly characterize the activity intensity as LPA and moderate-intensity PA, respectively. Similarly, 2.6% and 3.2% of the time participants spent jogging and using the stairs (MVPA) elicited an energy expenditure < 3.0 METs, which would incorrectly characterize the activity intensity as LPA [23]. Therefore, we feel that the use of DO was preferable over measured energy expenditure for characterizing PA intensity in the current study.
Another study limitation is that our ANN models were not designed to differentiate between moderate-intensity PA (3.0-5.9 METs) and vigorous-intensity PA (≥ 6.0 METs), instead grouping these into a single MVPA category. While our main goal was to develop models capable of differentiating SB, LPA, and health-enhancing PA (MVPA), further differentiation into moderate and vigorous intensity categories would allow for more specific assessment of meeting PA recommendations. Finally, two different brands of accelerometers were used in the current analysis, making it possible that differences seen between the wrists (GENEA accelerometers) and hip and thigh (ActiGraph accelerometers) were due to accelerometer brand. A study by John et al. found that time domain features (similar to those used in the current study) were not interchangeable between monitor brands, providing evidence that ANNs from the wrist-worn GENEA accelerometer would have lower accuracy if used with wrist-worn ActiGraph accelerometers. However, John et al. also found similar PA measurement accuracy when using ANNs developed specifically for each two accelerometer brand [45]. Therefore, it is unlikely that accelerometer brand affected measurement accuracy.

Conclusion
An accelerometer worn on the right thigh, coupled with an ANN model, achieved high accuracy for classification of three distinct PA intensity categories (SB, LPA, and MVPA) as well as breaks in SB in a semi-structured setting. An accelerometer worn on the left wrist also had high accuracy for assessment of SB but had some misclassification of LPA and MVPA, whereas accelerometers worn on the right wrist and hip had the lowest accuracy for assessment of all PA intensity categories and for measuring breaks in SB. These findings support the use of a thigh-worn accelerometer for assessment of time spent in different PA intensity categories. Alternately, for researchers using wrist-worn accelerometers to assess PA, wear on the non-dominant wrist is likely to allow for higher measurement accuracy than wear on the dominant wrist. Further research should cross-validate these ANN models in a free-living setting to confirm findings from this study.