Supervised versus unsupervised approaches to classification of accelerometry data

Abstract Sophisticated animal‐borne sensor systems are increasingly providing novel insight into how animals behave and move. Despite their widespread use in ecology, the diversity and expanding quality and quantity of data they produce have created a need for robust analytical methods for biological interpretation. Machine learning tools are often used to meet this need. However, their relative effectiveness is not well known and, in the case of unsupervised tools, given that they do not use validation data, their accuracy can be difficult to assess. We evaluated the effectiveness of supervised (n = 6), semi‐supervised (n = 1), and unsupervised (n = 2) approaches to analyzing accelerometry data collected from critically endangered California condors (Gymnogyps californianus). Unsupervised K‐means and EM (expectation–maximization) clustering approaches performed poorly, with adequate classification accuracies of <0.8 but very low values for kappa statistics (range: −0.02 to 0.06). The semi‐supervised nearest mean classifier was moderately effective at classification, with an overall classification accuracy of 0.61 but effective classification only of two of the four behavioral classes. Supervised random forest (RF) and k‐nearest neighbor (kNN) machine learning models were most effective at classification across all behavior types, with overall accuracies >0.81. Kappa statistics were also highest for RF and kNN, in most cases substantially greater than for other modeling approaches. Unsupervised modeling, which is commonly used for the classification of a priori‐defined behaviors in telemetry data, can provide useful information but likely is instead better suited to post hoc definition of generalized behavioral states. This work also shows the potential for substantial variation in classification accuracy among different machine learning approaches and among different metrics of accuracy. As such, when analyzing biotelemetry data, best practices appear to call for the evaluation of several machine learning techniques and several measures of accuracy for each dataset under consideration.


| INTRODUC TI ON
Understanding how animals behave and move is important to improve wildlife monitoring and management, especially for species that face increasing risks in rapidly changing landscapes (Kays et al., 2015). Sophisticated animal-borne sensor systems, accelerometers in particular, now offer the possibility of continuously monitoring the activities of free-ranging animals and their movement without the logistical difficulties of direct observation and accessibility (Fischer et al., 2018;Nathan et al., 2012;Shamoun-Baranes et al., 2012;Wilson et al., 2006;Yoda et al., 2001). These tools have been applied in studies of many species and habitat types to answer questions on foraging and hunting (Hernández-Pliego et al., 2017;Sato et al., 2015;Williams et al., 2015), energy expenditures (Collins et al., 2015;Elliott et al., 2014;Gómez Laich et al., 2011;Wilson et al., 2020), movement behavior (Ishii et al., 2017;Williams et al., 2020;Yoda et al., 2001), and migration strategy (Bishop et al., 2015;Weimerskirch et al., 2016). The diverse and expanding quality and quantity of accelerometer-derived data have created a need for robust analytical methods to interpret patterns in data.
A suite of methods has been developed to extract information about animal behavior from accelerometer data, many using techniques that have been broadly termed "machine learning". The term "machine learning" (ML) has been defined as programming that allows a computer to learn from experience, where learning can be measured and scored iteratively (Samuel, 1959). Typically, machine learning involves a series of steps including data collection, feature selection, problem definition, algorithm and parameter selection, and model training and evaluation ( Figure 1). For accelerometry, the simplest machine learning classification problems use validation data to inform the classification of tri-axial measurement to behavior types (e.g., acceleration pattern x means behavior A and acceleration pattern y means behavior B; Studd et al., 2019). Without validation data, the ML classification problem can be approached from an unsupervised clustering perspective (Sakamoto et al., 2009).
Alternatively, when only a small quantity of annotated acceleration data is available, it is possible to use a semi-supervised ML method that blends elements of clustering and supervised learning (Tanha et al., 2012). Finally, machine learning research questions may focus on anomaly detection. In the case of accelerometry, this is less common, although anomalous readings are sometimes investigated to understand if they are driven by biology or technology (Tobin et al., 2020).
Unsupervised ML techniques commonly are used to interpret behavior from accelerometer data (Chimienti et al., 2016). Although fairly easy to implement, the accuracy of these methods is hard to interpret given that unsupervised methods do not require validation data. It is therefore surprising that there is little guidance as to the utility and appropriateness of unsupervised methods for these analyses. To address this problem, we evaluated the effectiveness of supervised (n = 6 modeling approaches) and semi-supervised ML (n = 1) methods relative to that of two unsupervised methods to analyze accelerometry data. The data we used were collected from a critically endangered species-the California condor (Gymnogyps californianus)-that is the focus of extensive monitoring and management. Our analysis provides insight into the relative value of different approaches to the analysis of accelerometry data. We also use the information we generate to create a set of recommended best practices for interpreting accelerometry data.

| Study area and model species
The California condor is the largest soaring bird in North America.
Under the framework of a "Condor Recovery Program," condors are now captive-bred and released to sustain a wild population

T A X O N O M Y C L A S S I F I C A T I O N
Applied ecology, Behavioural ecology F I G U R E 1 Representation of the components of a generic machine learning model. The steps shown in the figure were used to identify behavioral modes or states from accelerometry data in this study and are described in detail in Section 2. spread across the southwestern United States and northwestern Mexico (USFWS, 2013). From approximately August to November of each year, captive-bred juvenile condors are released at Bitter Creek National Wildlife Refuge, located in the foothills of the southwestern San Joaquin Valley in Kern County, California, USA (USFWS, 2013). Prior to release, the condors spend time in a captive enclosure (hereafter, a "flight pen") with minimum exposure to humans (USFWS, 2013). This allows them to get familiar with the release site, and to engage in typical condor behaviors while interacting with wild condors that are perched or feeding nearby but outside the pen.

| Accelerometry data collection
We outfitted nine condors in the flight pen with patagial tags, each with a unique ID, and a proprietary solar-powered Global Positioning System-Global System for Mobile Communications (GPS-GSM) telemetry device weighing 50 g (Cellular Tracking Technologies, LLC).
In addition to GPS data (which were not used in this study), the units collected tri-axial acceleration data at a rate of 20 Hz. Although patagial tags were generally on the right wing, that was not always the case and there is a chance that differences among individuals could stem from tags being on different wings. Data were transmitted once daily over cellular networks and then downloaded to a server.
For additional details on wing tagging and telemetry of condors, see Poessel et al. (2018).

| Video data collection
We used digital cameras located inside the flight pen to record continuous video of condors and condor behavior. We recorded video continuously during daylight hours with two cameras (one AXIS P3367-VE, Axis Communications AB and one AV3115, AV Costar) located at opposite ends of the flight pen. The cameras were mounted to the walls of the flight pen at ~2 m above ground in a configuration that together allowed observation of individual birds and identification of codes on patagial tags located at any spot within the flight pen. The digital data recorded by the cameras were backed up nightly to an external hard drive. We used the Milestone XProtect Essential+ application for video management and instantaneous viewing (Milestone Systems A/S).

| Segmentation and identification of behaviors
Prior to classification, continuous accelerometry data usually are divided (hereafter, "segmented") into either variable-or fixed-time segments that use inherent characteristics within the data to identify boundaries (change points) between different behavioral states (Sur et al., 2017). We used variable time segments because this approach improves classification accuracy by better grouping together similar behaviors (Bom et al., 2014). We used a nonparametric model framework implemented with the function processStream in the package "cpm" (Ross, 2015) in R (v4.0.2; R Core Team, 2021) to identify change points (Bom et al., 2014;Shamoun-Baranes et al., 2012;Sur et al., 2017). Because the mean and variance responded most strongly to accelerometer values on the X axis, these values were chosen as input to the model (e.g., Bom et al., 2014).
Once acceleration data were segmented, a single observer (MS) reviewed videos to assign behaviors from a pre-defined ethogram to each of the variable-time segments identified by the change point model. This pre-defined ethogram was derived based on behaviors observed in the field and in preliminary viewing of the digital video we recorded. Our ethogram included eight types of behavior-sitting, walking, drinking, feeding, sunbathing, social interaction, preening, and flying ( Table 1). Since the flight pens had snags on which condors perched, for those behaviors that could occur either on the ground or on a perch (sitting, walking, sunbathing, social interaction, and preening), we differentiated between events occurring on the ground versus those on a perch. If a segment included two behaviors, we categorized that segment with the behavior occupying the majority of time in the segment. Finally, we realized that some of these behaviors would be difficult to distinguish based solely on accelerometer data and that it would be difficult to identify behaviors on the ground versus on a perch. As such, we evaluated the initial performance of ML models after combining similar behaviors into a single category and ignoring ground versus perch categories, and we used in our models the reduced ethogram (see Section 3 for details).

| Accelerometer-derived metrics
We considered 33 accelerometer-derived metrics as input into our model (Brown et al., 2013;Nathan et al., 2012;Shamoun-Baranes et al., 2012). Of these, 24 metrics were calculated using raw data from each of the three axes collected over a single segment of behavior that was defined by the change-point model. On each axis, eight metrics were considered; these were the mean, standard deviation, median, minimum, maximum, range, skewness, and kurtosis.
The remaining nine metrics were derived. These included the static and dynamic acceleration measured on each of the three axes (six metrics), and the pitch, overall dynamic body acceleration, and wing beat frequency (details of calculation methods for each of these are described in Patterson et al., 2018). All derived metrics were first calculated on the raw data and then averaged for each segment.

| Data organization
We used the package "caret" in R (Kuhn et al., 2020; v: 6.0-86) to implement the model-building and evaluation processes. Details of the functions that can be used for pre-processing, data splitting, model tuning, and estimating model accuracy are provided in the resources for this package (Kuhn, 2019). Here, we review the subset of steps that we followed for this study. As such, these may provide a convenient guide to others who wish to build predictive machine learning models using accelerometer data. The steps we followed within "caret" are as follows: 1. Pre-processing: We began with the process of "feature selection," to reduce the set of accelerometer-derived metrics used as input into our models (Nathan et al., 2012;Patterson et al., 2018).
Feature selection improves the performance of machine learning algorithms and reduces the computational time (Kuhn, 2008(Kuhn, , 2012. We used the function "findCorrelation" within "caret" to create a correlation matrix and find and remove highly correlated accelerometer-derived metrics . The function considers the absolute values of pairwise correlations and, when two variables are highly correlated, it removes the variable with the largest mean absolute correlation. We used a pairwise absolute cut-off of 0.75 to identify "highly" correlated variables (Kuhn & Johnson, 2013).

Data splitting:
We used 70% of our data for training the model (hereafter, the "training data") and the remaining 30% to test classification accuracy ("testing data"). We used the function "create-DataPartition" within "caret" to generate a balanced 70/30 split of the data (i.e., balanced meaning that the distribution of behavior types within the testing and training datasets mimicked that in the overall dataset; Nathan et al., 2012).
3. Subsampling for class imbalance: Our data were significantly unbalanced with a significant disparity of frequency of observed behaviors. In order to correct this, we used the synthetic minority oversampling technique or SMOTE from the package "DMwR" on the training data (function "SMOTE"; v 0.4.1, Torgo, 2011).

Centering and scaling:
We centered and scaled the variables of the training dataset sequentially using the functions, "preProcess" and "predict. preProcess" within "caret". The first of these determines which continuous variables need to be centered and scaled, and the second centers and scales those variables. This process can either be done as a stand-alone step or incorporated into the subsequent training step.

Cross-validation and tuning:
The training dataset can also be used for evaluating the performance of the algorithms using crossvalidation and for tuning the parameters of the algorithms.
Although most algorithms have default parameter settings, the performance of the algorithm can be improved for each individual study by changing the values of the parameters. Tuning is described below for each model type. We implemented models within the package "caret" and compared them using their resampling distribution (Eugster & Leisch, 2011;Hothorn et al., 2005). We used the "train" function in "caret" to automate the process of model training, parameter tuning, and model selection based on optimal values of these parameters.

| Supervised and semi-supervised ML models
All models were fit on the same training data and using the same resampling profiles. We used the argument "trainControl" to specify TA B L E 1 Ethogram of behaviors used to annotate video of behaviors of California condors. The initial ethogram included eight types of behavior-sitting, walking, drinking, feeding, sunbathing, social interaction, preening, and flying. The final category was the ethogram used in the analysis. Some behaviors occur only on the ground, others on the ground or a perch. The number of segments in total, as well as in the training and testing dataset, and the total time annotated into each behavior category are also reported (see Section 2.4 for more details). Finally, a short description of each behavior type is provided.
the type of resampling (trainControl(method = "repeatedcv", number 10, repeats = 3)) and the argument "allowParallel = TRUE" to allow parallel processing and reduce computational time. This approach produces resampling results across tuning parameters (indicated by accuracy and kappa statistics). We also tested the performance of the models on the testing data.
We also customized the tuning process for a subset of our supervised models using the function "tunelength," with a value of 10, meaning that the performance of a model was evaluated using 10 different values of each tuning parameter. Since this is a computationally intensive task, we customized the tuning process for only the top two models and evaluated their performance on the testing data.
The one semi-supervised model whose performance we evaluated was the nearest mean classifier from the package "RSSL" (v.0.9.5; Krijthe & Loog, 2015;Krijthe, 2016). Since the algorithm works by iteratively labeling the unlabeled behaviors and adding these predictions to the set of labeled behaviors until the classifier converges, we first randomly removed 70% of the labels from the training dataset using the function "add_missinglabels_mar." We then used the function "SelfTraining" to predict the labels of the unlabeled data of the training dataset. Of these predicted class labels, the ones with the highest probability of being correct are adopted as "pseudo-labels" and the algorithm is iteratively improved. These data with the labeled and pseudo-labels were then used to train the final algorithm.
Finally, we used our best-performing models of each model type, with model-specific tuning parameters, to predict behavior types in the testing data (function "predict" in "caret"). We then used the function "confusionMatrix" to construct a confusion matrix of actual and predicted behavior to evaluate the performance of the algorithms.

| Unsupervised ML approaches
We evaluated the performance of two unsupervised ML approaches, K-means clustering (Sakamoto et al., 2009) and EM clustering ("expectation-maximization"; Chimienti et al., 2016). We first implemented these models on the training data, setting the model to identify an optimal number of clusters. We used the function "fviz_nbclust" to determine and visualize the optimal number of clusters using within-cluster sums of squares (package "factoextra"; Kassambara & Mundt, 2017; v 1.0.7). We then evaluated descriptive statistics for each cluster to identify the behavior they represented.
Finally, we used the observed clusters to classify the testing dataset.
We implemented K-means using the function "KMeans_rcpp" from the package ClusterR (v. 1.2.4; Mouselimis, 2021). We ran the algorithm on the training data with five initializations and a maximum of 10 iterations (Mouselimis, 2021). The computational time of the algorithm can be adjusted using the number of initializations and iterations (num_init, max_iters; Mouselimis, 2021). The centroids identified from the training dataset were then used to cluster the testing dataset using the function "predict_KMeans." Similarly, we used the function "MclustDA" from the package "mclust" (v. 5.4.7; Fraley & Raftery, 2002) to cluster the training data using the EM algorithm. We then used the "predict" function from the same package to cluster the testing dataset.

| RE SULTS
The full dataset we considered included 27.4 h of paired video and accelerometry data ranging from 0.6 to 5.9 h per condor (median = 3.08 h). This included 152,947 behavioral segments, of which <1% were of birds flying, <1% were of walking, 2.2% were of drinking or feeding, and 96.9% were of sitting ( After the initial evaluation of model performance, we verified that our ML models had difficulty distinguishing sitting from social interaction, sunbathing, and preening. We, therefore, grouped all four of these behaviors into a single "sitting" category. Likewise, models were unable to distinguish drinking from feeding, so we grouped these two behaviors together. This process resulted in a reduced ethogram with four behavioral classes-"Walking," "Drinking/ Feeding," "Sitting," and "Flying"-that were input into classification models. We detected a substantial correlation among accelerometerderived metrics ( Figure 2) and ultimately removed 25 variables from consideration. The eight we retained included the median of the Y axis, standard deviation of the Z axis, mean of pitch, static acceleration on the Z axis, dynamic acceleration on the X axis, dynamic acceleration on the Y axis, dynamic acceleration on the Z axis and wing beat frequency (WBF). The distribution of these metrics among the four behaviors we considered is given in SI1 in Data S1.

| Comparison of ML modeling approaches
Mean classification accuracy during training of the six supervised ML models we considered ranged from 0.59 to 0.91 ( Table 2, SI2 in Data S1). Of these, random forest (RF) and k-nearest neighbor (kNN) performed the best, each with overall predictive accuracies >0.81. Kappa statistics were also highest for RF and kNN, in most cases substantially greater than for other modeling approaches.
Linear discriminant models were the worst performers of the models we considered, and neural network, support vector machine, and classification and regression tree were intermediate performers.
Interestingly, when we applied these models to the testing data, the balanced accuracy when predicting all four behaviors was highest for the neural network (SI3 and SI4 in Data S1). Sensitivity and specificity of algorithms were variable, although generally reasonable for flying and sitting behaviors, and poor for drinking/feeding and walking (SI3 in Data S1).
We customized the tuning parameters for the two supervised models with the best performance (random forest and k-nearest neighbor). For RF, we evaluated model performance using values of mtry ranging from 1 to 10 (mtry is the number of variables randomly sampled at each split in the random forest algorithm).
Accuracy of the algorithm was best with a mtry value of 5.
Likewise, for kNN, we evaluated model performance using values of k ranging from 5 to 25 (k is the number of neighbors considered by the algorithm to determine the classification of a specific query point). Accuracy of the algorithm was best when the value of k was 7. Customizing the tuning process for these models did not change their performance, whether measured overall or by any F I G U R E 2 Pairwise correlation of 33 accelerometer-derived metrics that were used as input into our model. Of these, 24 metrics were calculated using raw data from each of the three axes (X, Y, and Z) collected over a single change-point model-defined segment of behavior. These 24 metrics were eight statistics-the mean, standard deviation, median, minimum, maximum, range, skewness, and kurtosis-each calculated on each of the three axes. The remaining nine metrics were derived from these data and included the static and dynamic acceleration measured on each of the three axes, as well as the pitch, overall dynamic body acceleration (ODBA), and wing beat frequency (WBF). The figure shows the sign and magnitude of the correlation value, using two colored hues, where the intensity of color increases uniformly as the correlation value moves away from 0. Color (blue for positive values, red for negative values) signifies the sign of the correlation while the fill area is proportional to the absolute value of the correlation.

TA B L E 2
Model accuracy parameters for six supervised classification models of accelerometer data collected from California condors. Note: Models were ranked by two parameters, mean accuracy and mean Kappa. Model types are random forest (RF), k-nearest neighbor (kNN), neural network (NN), support vector machine (SVM), classification and regression tree (CART), and linear discriminant analysis (LDA).

Model
one of the eight parameters we considered, including balanced accuracy ( Table 3).
The semi-supervised nearest mean classifier showed similar patterns in performance as did the supervised algorithms ( Table 4). The overall accuracy of the algorithm was 0.61, and the balanced accuracy was highest for sitting and flying behaviors (0.65 in both cases), but poor for other behavior classes. Model classification parameters were generally worse for semi-supervised nearest mean classifier than they were for RF and kNN models.
The optimal number of clusters for unsupervised modeling was four (SI5 in Data S1), and K-means and EM clustering had overall classification accuracies of 0.61 and 0.77, respectively. However, the kappa statistics for both models were extremely poor at −0.02 to 0.06, respectively. Both unsupervised classification models, Kmeans and EM clustering, resulted in class-specific, balanced classification accuracies that generally were similar or worse than either the RF or kNN models ( Table 5). That said, it is notable that these models performed reasonably well at predicting the two less prevalent behavior classes, drinking/feeding and walking.

| DISCUSS ION
Here, we have demonstrated the importance and value of supervised classification approaches to the analysis of accelerometry data. Unsupervised approaches for the classification of behavior, although commonly used (Bishop et al., 2015;Chimienti et al., 2016;Sakamoto et al., 2009), provided some unique information, but overall they lacked the accuracy, and subsequent value for inference, of supervised approaches. Furthermore, although there have recently been several studies comparing different types of supervised classification methods (Rast et al., 2020;Tatler et al., 2018;, our study is one of few (e.g., Patterson et al., 2018) that specifically compares unsupervised, semi-supervised, and supervised approaches to classification. As such, this study provides important insight into not only the relative value of these different approaches, but into when it may be appropriate to use each type of approach.

| Comparison of ML modeling approaches
There were distinct differences in performance not only among supervised, semi-supervised, and unsupervised approaches, but also among the six supervised modeling approaches we evaluated. Our analysis provides no support for the continued use of unsupervised ML approaches to the classification of a priori-determined behavioral classes and little support for the semi-supervised nearest mean classification approach. Furthermore, among the supervised approaches, RF and kNN performed best (as indicated by kappa statistics), suggesting that these two approaches may be best suited to future classification problems. The existing literature is fairly consistent with these findings. Specifically, both Tatler et al. (2018) TA B L E 3 Model accuracy parameters for two different approaches (random forest (RF) and k-nearest neighbor (kNN)) to supervised classification of accelerometer data collected from California condors.

| Inference from unsupervised classification of bio-logging data
There is a growing trend in the classification literature calling for or using either supervised classification or post hoc calibration of behavior when conducting analysis of accelerometry or other biologging data (Elbroch et al., 2018;Garrod et al., 2021;Halsey, 2017;Halsey & Bryce, 2021;Rast et al., 2020;Van Walsum et al., 2020).
The body of literature in this field is growing more robust as there is an increase in the diversity of species evaluated and analytical techniques considered. Furthermore, these analyses show that if the goal is interpreting overall energy expenditure, specific behaviors that are defined a priori, or specific aspects of behavior (e.g., characteristics of wing beats or strides), unsupervised models are a poor choice (although, as we discuss below, there is a role for these approaches).
As one of the first papers whose explicit goal was specifically to compare supervised versus other approaches to classification, our study is consistent with other studies that call into question some uses of unsupervised classification of bio-logging data. It is true that there can be substantial difficulties required to gather supervised data or to validate or calibrate classification. Despite these challenges, it is increasingly recognized that this effort is essential for inference drawn from matching of a priori-defined behaviors to accelerometry data.
Despite the weaknesses of unsupervised classification approaches, there are situations where using this toolkit is appropriate.
EM clustering in particular performed reasonably well at identifying rarely encountered behaviors. Likewise, although unsupervised classification is a poor choice for the identification of a prioridefined behaviors, it can effectively identify patterns or clusters in accelerometer signals, and these can be interpreted as "behavioral modes." As such, it is reasonable to use post hoc data interpretation to characterize predominant behaviors within those modes. For example, K-means clustering has been used to identify behavioral "groups" from accelerometer data collected from European shag (Phalacrocorax aristotelis) in Scotland, UK (Sakamoto et al., 2009), and "movement states" from short-interval GPS data collected from bald eagles in the midwestern USA (Bergen et al., 2022). In situations such as these, use of unsupervised classification is appropriate, since inference is not to specific behaviors, but to "modes" or "states" in which accelerometer data cluster together.

| CON CLUS IONS
Our work, together with that from earlier studies, suggests several best practices for use of accelerometry data to draw inference about animal behavior. These are: 1. Unsupervised classification techniques are not appropriate for identifying a priori-defined behaviors in telemetry data. These tools are reasonable for information gathering and inference to post hoc definition of more generalized behavioral modes or states.
2. Those wishing to use ML tools to identify behavioral modes or states from accelerometry data may wish to evaluate several ML classification algorithms to identify which works best with the unique features of their data. This is likely especially important when behavioral datasets are unbalanced, as was ours and as would be most biologically realistic data.
TA B L E 5 Model accuracy parameters for two approaches to unsupervised classification (K-means clustering and expectation maximization (EM) clustering) of accelerometer data collected from California condors. Note: Parameters are defined in S2 supporting information of Sur et al. (2017).