Autocalibration of accelerometer data for free-living physical activity assessment using local gravity and temperature: an evaluation on four continents

Wearable acceleration sensors are increasingly used for the assessment of free-living physical activity. Acceleration sensor calibration is a potential source of error. This study aims to describe and evaluate an autocalibration method to minimize calibration error using segments within the free-living records (no extra experiments needed). The autocalibration method entailed the extraction of nonmovement periods in the data, for which the measured vector magnitude should ideally be the gravitational acceleration (1 g); this property was used to derive calibration correction factors using an iterative closest-point fitting process. The reduction in calibration error was evaluated in data from four cohorts: UK (n = 921), Kuwait (n = 120), Cameroon (n = 311), and Brazil (n = 200). Our method significantly reduced calibration error in all cohorts (P < 0.01), ranging from 16.6 to 3.0 mg in the Kuwaiti cohort to 76.7 to 8.0 mg error in the Brazil cohort. Utilizing temperature sensor data resulted in a small nonsignificant additional improvement (P > 0.05). Temperature correction coefficients were highest for the z-axis, e.g., 19.6-mg offset per 5°C. Further, application of the autocalibration method had a significant impact on typical metrics used for describing human physical activity, e.g., in Brazil average wrist acceleration was 0.2 to 51% lower than uncalibrated values depending on metric selection (P < 0.01). The autocalibration method as presented helps reduce the calibration error in wearable acceleration sensor data and improves comparability of physical activity measures across study locations. Temperature ultization seems essential when temperature deviates substantially from the average temperature in the record but not for multiday summary measures.

WEARABLE ACCELEROMETERS ARE increasingly used in the assessment of physical activity (2,4,6). In recent years accelerometers have become available that are feasible for long-term monitoring of behavior in population studies, while at the same time being capable of storing weeklong data in g-units (1 standard g ϭ 9.80665 m/s 2 ) at a sample frequency high enough to capture the main frequencies of body movement, referred to as raw data accelerometry (12). Population studies collecting raw accelerometer data include surveillance studies like NHANES (26) in the U.S. and national biobanks such as UK Biobank (27).
An acceleration sensor works on the principle that acceleration is captured mechanically and converted into an electrical signal, which depending on the sensor type is either a voltage, a resistance, or a capacitance (13). The relationship between the electrical signal and the acceleration is usually assumed to be linear, involving an offset and a gain factor. We shall refer to the establishment of the offset and gain factor as the sensor calibration procedure (5,18). Accelerometers are usually calibrated as part of the manufacturing process under nonmovement conditions using the local gravitational acceleration as a reference (5,18). The manufacturer calibration can later be evaluated by holding each sensor axis parallel (up and down) or perpendicular to the direction of gravity; readings for each axis should be Ϯ1 and 0 g, respectively (5,18).
However, this procedure can be cumbersome in studies with a high throughput. Furthermore, such a calibration check will not be possible for data that have been collected in the past and for which the corresponding accelerometer device does not exist anymore. Techniques have been proposed that can check and correct for calibration error based on the collected triaxial accelerometer data in the participant's daily life without additional experiments, referred to as autocalibration (6a, 8 -10, 19). The general principle of these techniques is that a recording of acceleration is screened for nonmovement periods. Next, the moving average over the nonmovement periods is taken from each of the three orthogonal sensor axes and used to generate a three-dimensional ellipsoid representation that should ideally be a sphere with radius 1 g, see example in Fig.  1. Here, deviations between the radius of the three-dimensional ellipsoid and 1 g (ideal calibration) can then be used to derive correction factors for sensor axis-specific calibration error (6a, 8 -10, 19).
Previously published work on autocalibration techniques focused on the technical description and proof of concept but did not demonstrate feasibility and accuracy in wrist accelerometer data collected under real study conditions, involving participants under free-living conditions (daily life) and in a diverse sample of the global population (6a, 8 -10, 19). Fur-thermore, it remains unknown whether autocalibration has a significant impact on acceleration metrics typically used for physical activity assessment.
Temperature has been identified as a potential source of calibration error in low cost acceleration sensors (20). The specification sheet of the acceleration sensor chip used in the GENEActiv accelerometer (ADXL345; Analog Devices) as used in this study, indicates that a change of 1°C relative to 25°C could result in a 0.4 to 1.2 mg (1 g ϭ 1,000 mg) change in acceleration value (1). It could therefore be hypothesized that the availability of temperature information alongside measurement of acceleration may aid the autocalibration process.
The current study aims to describe an autocalibration method that can be configured to take into account a potential temperature dependency of the sensor's response to acceleration. The second aim is to implement and evaluate the autocalibration method in a diverse sample of the global population. The third and final aim is to demonstrate the degree to which application of the autocalibration method has any significant impact on metrics derived for physical activity assessment.

METHODS
Population data. The autocalibration method was evaluated based on data collected with wrist-worn raw accelerometry in subsamples of epidemiological cohorts from Africa, Europe, South America, and the Middle-East, representing locations with different gravity. Cohorts included: The Fenland Study (Cambridgeshire, UK) (21), a repeated cross-sectional survey of the Cameroon Physical Activity Study (3), the Kuwait Wellbeing Study, and the 1993 Pelotas birth cohort (Brazil) (28). The data subsamples in each cohort span most local seasons and represent very diverse populations, lifestyles, and environmental conditions. Basic cohort characteristics are described in Table 1.
The same accelerometer brand was used in all cohorts (GENEActiv; Activinsights, Kimbolton, UK). This accelerometer includes a triaxial acceleration sensor (ADXL345) with a Ϯ8-g dynamic range and a 12-bit resolution and a temperature sensor (MCP9700T). Most of the devices used in the UK and Brazil cohort were older (lower serial number) than the devices used in the Kuwait and Cameroon cohorts. In all cohorts, participants were asked to wear the accelerometer on their nondominant wrist during sleeping and waking hour. All participants provided informed consent, and each study was approved by the local ethics committee.
Autocalibration method. Two versions of the autocalibration method were designed and evaluated; one based on acceleration data only (C 1) and one based on both acceleration and temperature (C2). For every consecutive time window of 10 s in a particular data record, the following signal features were extracted: average acceleration per axis, standard deviations in the acceleration per axis, and average temperature. For the calibration procedure, only time windows for which the standard deviation was Ͻ13 mg in all three axes were retained. Here, 13 mg was selected just above the empirically derived baseline (noise) standard deviation of 10 mg to retain only nonmovement periods. The resulting set of time windows, or calibration epochs, for each of the three axes can be presented in a threedimensional space as an ellipsoid (22), an example of which is shown in Fig. 1. The deviation between 1 g and the Euclidean norm ( ͙ a x 2 ϩa y 2 ϩa z 2 ) of the acceleration of the three axes is an indication of calibration error. Next, the axis-specific calibration for C1 can be defined as: Here, si(t) and s i ' (t) correspond to the acceleration signal before and after correction, respectively, i is the sensor axis (x, y, or z), t is the time point, di is the offset, and ai is the gain factor. Six parameters are optimized for this model C1, while minimizing the average calibration error defined as absolute differences between 1 g and vector magnitudes (Euclidean norms) calculated across all calibration epochs. If temperature is taken into account (C 2) the formula is: Here, T(t) is the temperature at time point t, c is the average temperature in the ellipsoidal data as used for the autocalibration procedure, and m i is the axis specific temperature-related offset corrections factor. The average temperature acts like a fixed reference point relative to which di, ai, and mi (9 parameters) are optimized. Mathematically, constant c could have been merged with di to shorten the equation but we have kept them separate to allow direct comparison of di parameters between the two autocalibration models.
An iterative closest point fitting process (ICP) of the moving average values to a sphere (C 1) or a hypercylinder (C2) was used to optimize the six (C1) or nine (C2) calibration correction factors, respectively. Here, the hypercylinder corresponds to the 1-g sphere augmented with a linear temperature offset adjustment for C2. The procedure was followed by downweighting outliers to minimize the impact of nonstationary data not being excluded based on the 13-mg threshold as mentioned earlier. Here, the weighting was calculated as 1-g divided by the absolute difference between the Euclidean norm corresponding to a 10-s calibration window (1 point on the ellipsoid) and 1 g with 100 being the maximum weighting. Consequently, all data points with Ͻ10-mg calibration error had a weighting of 100 (1 g/0.01 g ϭ 100). The weighting was updated at every stage of the iterative process. The ICP starting points were chosen based on the assumption that the optimal calibration factors is the set representing the local minimum error nearest to perfect factory calibration, with d i ϭ 0, ai ϭ 1, and mi ϭ 0. The ICP was limited to a maximum of 1,000 iterations and terminated sooner if iterative change in error was Ͻ1 Ϫ10 g.
To ensure a meaningful and robust autocalibration, it was only executed when the calibration ellipsoid was sufficiently sparsely populated with data points (calibration epochs). For this evaluation, we used a sparseness criteria of at least one ellipsoid value higher than 300 mg and at least one value lower than Ϫ300 mg for each of the three sensor axes. Only measurements were considered that lasted for at least 24 h as shorter measurements are commonly excluded when assessing habitual physical activity during free-living conditions.
To minimize signal processing time, the autocalibration method initially only uses the first 72 h (3 days) of a measurement file based on which calibration error reduction is evaluated. If the file length is Ͻ3 days, then all available data are used. If calibration error is not reduced to Ͻ10 mg or if the Ϯ300-mg criteria for ellipsoid data sparseness is not met, additional chunks of 12-h data are iteratively added until either error and sparseness criteria are met or until the end of the file is reached. The criterion of 10 mg was considered close to the resolution of the data (3.9 mg) and a realistic target based on pilot tests. Calibration error below the sensor resolution is theoretically possible, but these calibration errors may not be distinguishable from the impact of data resolution boundaries. Therefore, a calibration error reduction to Ͻ10 mg was considered acceptable. If the calibration error after autocalibration was higher than before autocalibration, then correction factors were replaced by default values 1 and 0 for gain and offset, respectively. The latter was done to avoid a negative influence of autocalibration on the data.
The method has been released as function g.calibrate in R-package GGIR, which currently works with binary data collected with the accelerometer used in the current study as well as its predecessor, GENEA (14). Additionally, an extract of the R-code related to the ICP fitting process is provided in the APPENDIX.
Evaluation. The absolute difference between 1 g and the Euclidean norm of the values of the three axes was averaged per measurement file (1 file ϭ 1 participant) and used as an indicator of calibration error before autocalibration (C 0), following autocalibration without temperature compensation (C1), and following autocalibration with temperature compensation (C2).
Further, we assessed the impact of autocalibration on population estimates of physical activity using two commonly used metrics of body movement: the Euclidean Norm Minus One with negative values rounded up to zero (ENMO) and band-pass filtering of three axis followed by Euclidean Norm of the resulting signals (BFEN), as previously described (11,13,25). BFEN was applied with a fourth order band-pass Butterworth filter with cut-off frequencies 0.5 and 15 Hz. Metric ENMO is similar in design compared with a metric used by colleagues, named SVMgs (7,23,29). See Hildebrand et al. (15) for a discussion on the subtle differences between SVMgs and ENMO.
Here, we looked at the average metric output and its distribution over each participant's measurement record based on 5-s epoch averages. The impact on the distribution was quantified as changes to the 5th, 25th, 50th, 75th, 95th, and 97.92nd percentiles. The latter percentile (97.92) corresponds to the 30 most active minutes in a day. All participant-level values were summarized as mean (SD) across each cohort. Data cleaning stages, including nonwear detection, were applied as reported previously (11,14,24). Detected monitor wear duration was used to evaluate whether monitor wear duration plays a role in the success of the autocalibration procedure.
Finally, to estimate the relative importance of correcting offset or gain factors we selected a random sample of 20 accelerometer record- ings from the Pelotas cohort and investigated how autocalibration performance is affected when optimizing only offset or gain, with the corresponding other set of factors fixed to 1 (gain) or 0 g (offset), respectively.
Statistics. All statistical analyses were conducted in R (http:// cran.r-project.org/). Wilk's lambda test was used to compare the three autocalibration configurations across all percentiles. If Wilk's lambda test indicated a significant difference, then repeated measures ANOVA was used to compare the three autocalibration configurations per metric, using the function lme from the nlme-package and the function anova from the stats-package (20a). Post hoc pair-wise Tukey tests were performed using the function glht from the multcomp package (16). Significance was set at P Ͻ 0.05. Table 2. Application of the autocalibration method significantly reduced calibration error in all cohorts (P Ͻ 0.01), with improvements being greatest in the Brazilian cohort (from 76.7 to 8.0 mg) and smallest in the Kuwaiti cohort (from 16.6 to 3.0 mg; see Table 3). However, no significant further reduction in calibration error was observed in any of the four cohorts between autocalibration with temperature utilization (C 2 ), compared with that without temperature utilization (C 1 ) (P Ͼ 0.05; see Table 3). The percentage of files with calibration error under 10 mg was 6.1, 94.4, and 99.0% for C 0 , C 1 , and C 2 respectively (Pearson's Chi-squared: 2 ϭ 3,816.0, df ϭ 2, P Ͻ 0.0001). An animation of the calibration ellipsoid before and after calibration can be found on our website: http:// www.mrc-epid.cam.ac.uk/research/resources. Application of the autocalibration method had a significant impact on the average and distribution of acceleration metric output in each of the four cohorts (F Ͼ 5.8, P Ͻ 0.001). The magnitude of the difference between C 0 and C 1 for metric BFEN was systematically Ͻ1 mg, which was in contrast to metric ENMO for which differences of 20 mg and higher were observed between C 0 and C 1 (see Tables 4 and 5). Post hoc Tukey analyses revealed no significant difference in metric output between C 1 and C 2 , except for the lower range in the distribution of acceleration values in the UK cohort, see Tables  4 and 5.

Average calibration correction factors are reported in
The minimum within-person temperature range observed within the ellipsoid data was 8.8, 9.7, 6.4, and 7.1°C for UK, Kuwait, Cameroon, and Brazil, respectively.
For the evaluation of the relative importance of offset and gain (Pelotas subset), autocalibration based on only offset correction or only gain correction reduced a 65.0 Ϯ 26.7 mg   Data are presented as sample mean (SD) and percentiles based on 5-s epoch averages; Pk ϭ kth percentile. ENMO (in mg), the Euclidean Norm Minus One; C0, no autocalibration; C1, autocalibration without temperature; C2, autocalibration with temperature. *P value for ANOVA and Wilk's lambda; P values for Tukey test are indicated with the following symbols: , significant pair-wise differences between C0-C1 and C0-C2; OE, significant pair-wise differences for C0-C1 and C1-C2; , significant pair-wise difference for C0-C1, C0-C2, and C1-C2.
calibration error (C 0 ) to 12.4 Ϯ 12.7 and 51.8 Ϯ 31.2 mg, respectively. Additional ultization of temperature reduced these calibration errors to 10.2 Ϯ 13.3 and 45.0 Ϯ 23.9 mg, respectively, while optimizing both offset and gain resulted in calibration errors of 4.6 Ϯ 1.3 and 7.7 Ϯ 2.7 mg for C 2 and C 1 , respectively.

DISCUSSION
The autocalibration method as presented allows for a significant reduction in average calibration error under a wide range of study conditions. Temperature ultization did not result in a significant further reduction of average calibration error for the measures selected. However, inspection of the derived temperature offset correction factors (Table 2) indicates that temperature ultization could be essential for sections of the signal with temperature conditions far away from the average temperature. For example, in the UK cohort the average temperature offset correction factor for the z-axis was 0.00392 (Table 2), which given a temperature difference of 5°C would result in a change of acceleration of 0.0196 g (5 ϫ 0.00392 g). A value of 19.6 mg may be considered high in the context of the acceleration value distribution as provided in Tables 4 and 5. The significant difference as found between C 1 and C 2 in the lower end of the metric value distribution in the UK cohort hints at an impact of temperature ultization that will only be visible in the most inactive parts of a day. Considering that sleep is likely to take up Ͼ25% of a day, it seems unlikely that the temperature dependency of the 5th and 25th percentiles relates to waketime behavior. Instead, accounting for temperature dependency may help to improve estimates of monitor nonwear time and the detection of sleep stages in future research.
The implementation of the autocalibration method had a significant impact on the average and distribution of metric outputs; however, substantially more so for metric ENMO compared with metric BFEN. In our previous study we observed that metrics ENMO and BFEN are highly correlated but not identical (11). Metric ENMO may be more appropriate for energy expenditure estimation and easier for researchers to describe, replicate, and interpret (11). In addition, the frequency filtering as part of metric BFEN effectively reduces calibration offset error, which explains why the autocalibration procedure as evaluated here shows only minor impact on these estimates. Temperature changes tend to be slow, which the band-pass filter would catch and remove as low frequency components (11). We conclude from this that autocalibration will have an important impact on studies that rely on average and distribution characteristics of metric ENMO but much less so for metric BFEN. Note that these findings should not be confused for the validity of metric BFEN or ENMO.
The strong relative importance of offset correction as seen in the subsample of 20 individuals combined with the fairly constant absolute difference between the cohort percentiles corresponding to C 0 and to C 1 (Tables 4 and 5) indicates that the offset calibration has a bigger impact compared with gain calibration. Translating this observation to physical activity research means that the impact of calibration error and therefore the benefit of autocalibration will be relatively high for physical activities involving low acceleration and relatively low for activities involving high magnitude accelerations.
Results indicate that the autocalibration method works under a wide range of experimental conditions, spanning different geographical latitudes, different seasons affecting temperature variation during the day, different populations affecting movement and activity patterns, different built environments, and different adult age groups. Nonetheless, the dataset as presented is insufficient to investigate the causal relationship between specific study conditions and calibration error. The difference in the precision gain of autocalibration between UK and Brazil on the one hand and Cameroon and Kuwait on the other hand may indicate that the relatively newer devices used for Cameroon and Kuwait have less calibration error. Again, a lack of standardized conditions complicates this comparison. It is also important to note that the proposed method effectively expresses all data relative to local gravity that has known geographical variation; one would need to multiply with the magnitude of local gravity to convert to absolute acceleration in meters per second squared. Despite the challenges in directly comparing the four cohorts, the results stratified by cohort illustrate that the method succeeds in reducing error in each of the four study settings and with an impact on typical physical activity summary measures proportional to baseline calibration error (C 0 ). Data are presented as sample mean (SD) and percentiles based on 5-s epoch averages; Pk ϭ kth percentile. BFEN (in mg), band-pass filtering of t3 axis followed by Euclidean Norm of the resulting signals. C0, no autocalibration; C1, autocalibration without temperature; C2, autocalibration with temperature; *P value for ANOVA and Wilk's lambda; P values for Tukey-test are indicated with the following symbols: , significant pair-wise differences between C0-C1 and C0-C2; OE, significant pair-wise differences for C0-C1 and C1-C2; , significant pair-wise difference for C0-C1, C0-C2, and C1-C2; ns, P value for ANOVA Ͼ0.05.
The current study was done with wrist-worn accelerometers. Compared with other body locations wrist attachment may allow for easier collection of sparse ellipsoidal data and by that enhancing the autocalibration process. Therefore, caution is needed when implementing this method on data collected from other body locations.
In conclusion, the autocalibration method as presented reduces the calibration error in acceleration data from wrist-worn sensors as collected on four continents. Temperature ultization seems essential for those sections of the signal where temperature deviates substantially from the average temperature, but less so for overall summary measures related to the average and distribution of the magnitude of acceleration over several days.

APPENDIX I: EXTRACT OF R-CODE RELATED TO ICP PROCEDURE FROM R-PACKAGE GGIR
The variable "input" is the average acceleration per axis per epoch provided as a matrix with three columns corresponding to the three axis. Variable "inputtemp" is the average temperature per epoch provided as a matrix with the temperature values replicated in three columns.

ACKNOWLEDGMENTS
The sponsors of the study had no role in study design, data collection, data analysis, data interpretation, or writing of the report. The corresponding author had full access to all the data in the study and had final responsibility for the decision to submit for publication.

GRANTS
This research was supported by funding from the Wellcome Trust, the Medical Research Council (MC_UU_12015/3), and His Highness Shiekh Nasser Al-Mohammad Al-Sabah, and Dasman Diabetes Institute.

DISCLOSURES
Joss Langford is employed by Activinsights We declare that we have no competing interests.