An active learning framework and assessment of inter-annotator agreement facilitate automated recogniser development for vocalisations of a rare species, the southern black-throated finch ( Poephila cincta cincta )

The application of machine learning methods has led to major advances in the development of automated recognisers used to analyse bioacoustics data. To further improve the performance of automated call recognisers, we investigated the development of efficient data annotation strategies and how best to address uncertainty around ambiguous vocalisations. These challenges present a particular problem for species whose vocalisations are rare in field recordings, where collecting enough training data can be problematic and a species’ vocalisations may be poorly documented. We provide an open access solution to address these challenges using two strategies. First, we applied an active learning framework to iteratively improve a convolutional neural network (CNN) model able to automate call identification for a target rare bird species, the southern black-throated finch ( Poephila cincta cincta ). We collected 9,098 hours of unlabelled audio recordings from a field study in the Desert Uplands Bioregion of Queensland, Australia, and used active learning to prioritise human annotation effort towards data that would best improve model fit. Second, we progressed methods for managing ambiguous vocalisations by applying machine learning methods more commonly used in medical image analysis and natural language processing. Specifically, we assessed agreement among human annotators and the CNN model (i.e. inter-annotator agreement) and used this to determine realistic performance outcomes for the CNN model and to identify areas where inter-annotator agreement may be improved. We also applied a classification approach that allowed the CNN model to classify sounds into an ‘uncertain’ category, which replicated a requirement of human-annotation and facilitated the comparison of human-model annotation performance. We found that active learning was an efficient strategy to build a CNN model where there was limited labelled training data available, and target calls were extremely rare in the unlabelled data. As few as five active learning iterations, generating a final labelled dataset of 1,073 target calls and 5,786 non-target sounds, were required to train a model to identify the target species with comparable performance to experts in the field.

and Campagner et al. (2021) describe as the 'elephant in the machine' in the context of medical ML applications. Cabitza et al. (2020) also argue that considering uncertainty in the annotation process is important to develop models that generalise to real-world scenarios.
We present here an open access, novel method to develop a call detection model for a threatened bird species, the southern black-throated finch (SBTF), Poephila cincta cincta, that lacks an existing labelled call dataset. Our method applies an active learning framework to prioritise scarce human annotation resources towards data that are the most likely to improve the model. Our work builds on the 'standard recipe' for bioacoustic classification described by Stowell (2022) and applies an active learning framework to iteratively improve the model. We also apply a novel method to include inter-annotator agreement and label uncertainty, which are rarely considered in bioacoustics, into the development and evaluation of call detection models. This work has broad application for many species that have limited existing training data available and/or for species whose calls may not be annotated with certainty.
The SBTF is listed as 'Endangered' under the Australian Environment Protection and Figure 1. Spatial distribution of acoustic recording units within the Study Area. Filled and hollow symbols represent recordings used for training and testing data, respectively. Recorders were placed within remnant woodlands within the Study Area, focussing on areas that comprised broadly suitable habitat for southern black-throated finch.

Data pre-processing
We split audio data into 1.8 second audio frames, which is approximately double the maximum length of the target call (Higgins, Peter and Cowling, 2006). We used a 50% overlap between frames, i.e. a sliding window approach, which ensured target calls would be entirely included within at least one audio frame (Kahl et al., 2021). We then transformed audio frames into mel-scaled spectrogram images using a short-time Fourier transform, with a Hann J o u r n a l P r e -p r o o f Journal Pre-proof window length of 2048, a 50% overlap between segments and 128 mel filter banks, following standard methods such as those applied by LeBien et al. (2020).
We chose a frequency bandwidth of 1.5 MHz to 5 MHz, which ensured the dominant frequency of the target call was included within the frame, while avoiding excess noise at lower frequencies and higher frequencies (e.g. cicadas). We found that this frequency bandwidth typically included a fundamental harmonic and/or a higher harmonic of the dominant frequency; however, these harmonics were not the focus of the bandwidth selection due to their rapid attenuation at lower sound pressure levels (Koehler et al., 2017). An example frame used as input into the CNN, following data pre-processing steps, is depicted within  Example audio frame following data pre-processing steps that was used as input into the convolutional neural network. The audio frame shows a SBTF call with a dominant frequency (stronger signal) and a fundamental harmonic (weaker signal) included in the frame.

CNN architecture
Application of CNNs to audio recognition tasks is well established, with CNNs forming part of the 'standard recipe' for bioacoustic classification (LeBien et al., 2020;Allen et al., 2021;Stowell, 2022). We used a data pre-processing pipeline and CNN architecture that have been widely applied for bioacoustic classification tasks (Christin, Hervet and Lecomte, 2019; Stowell et al., 2019;Stowell, 2022).
We compiled a CNN using Python (version 3.6.9, Python Foundation), within Google Colaboratory, to interface with Pytorch (version 1.10), an open source ML library (Paszke et al., 2019). We used a ResNet-34 model with pre-trained weights for the CNN architecture.
ResNet models typically achieve high-performance in image and audio recognition tasks (He et al., 2016;Stowell et al., 2019;Bergler et al., 2022) and have been widely applied for automated wildlife image and call recognition (Sankupellay and Konovalov, 2018;Kahl et al., 2021;Stowell, 2022). Our training dataset was imbalanced, with a lower number of audio frames containing a SBTF call than no-SBTF call. We followed the recommendations of Buda, J o u r n a l P r e -p r o o f Journal Pre-proof Maki and Mazurowski (2018) and oversampled the SBTF audio frames with the WeightedRandomSampler function in Pytorch. We used an Adam optimiser algorithm with an exponential learning rate decay function, which is a common method of learning rate optimisation used for CNN training (Kingma and Ba, 2014). We then applied a sigmoid activation to the output layer of the CNN to generate predictions between 0 and 1.
Active learning iterations used a batch size of 64, ten epochs and learning rate of 0.001. We used a grid search technique (Mohri, Rostamizadeh and Talwalkar, 2018) to tune hyperparameters of the final model including the number of epochs, batch size and learning rate.

Active learning framework
We applied an active learning approach to iteratively train and improve the CNN model. The active learning approach is depicted within Figure 3 and described below. augmented the initial training data by creating triplicate versions of each target call shifted at random horizontal (time domain) positions, which is a common technique to artificially increase the size of a training set (Stowell, 2022). An initial model was trained from this dataset.
Following the creation of an initial labelled dataset we developed the model using a series of active learning iterations. For each iteration we used the model to make predictions on a new set of unlabelled data, which comprised audio data from one solar-powered bioacoustic recorder (between 833 and 1402 hours) and one Audiomoth recorder (between 80 and 133 hours). Model predictions were in the form of a logit, on a scale of 0 to 1 (Monarch, 2021), where 0 represented the lowest probability of being the target call and 1 represented the greatest probability. Human annotators then labelled all predictions with a logit of greater than 0.5. The 0.5 logit cut-off was selected to prioritise the annotation of data likely to include target calls, which are rare within the unlabelled data, while also annotating signals that the model identified with the least certainty and would gain the greatest amount of new information, i.e. around the 0.5 logit; (Roh, Heo and Whang, 2019;Monarch, 2021). Through this active learning driven annotation process we grew the number of audio frames in the training data to 1,073 audio frames containing target calls and 5,786 non-target audio frames within five iterations (Table 1). All labels were reviewed by a primary annotator (J.V.O.) with 5 years' experience with SBTF prior to the next iteration of model training.
Each successive model iteration was trained on 70% of the training data and validated against 30% of the training data. The train/validation split approach minimises potential overfitting of the model on small samples sizes, when compared to a cross-validation approach (Vabalas et al., 2019). We chose F 1 score as our validation metric, which is a standard performance metric that includes information on both model precision and recall (Mohri, Rostamizadeh and Talwalkar, 2018,;Stowell, 2022). Iterative model training stopped when the stopping criterion was reached, which was when successive model iterations did not reduce the F 1 score. A final model was then trained using the combined training and validation data. A separate test dataset was used for final testing/evaluation of the model, which is described in Section 2.4.1.
J o u r n a l P r e -p r o o f Evaluation of models trained on data labelled through an active learning framework require evaluation designs beyond the standard method of testing the model on labelled data that are reserved from the initial labelled dataset, i.e. the 'hold-out test set' (Mohri, Rostamizadeh and Talwalkar, 2018;Stowell, 2022). This is because all labelled data available were identified using either hidden Markov model clustering in the Kaleidoscope Pro software (Wildlife Acoustics, 2019), or through the active learning process, which biases data to the model and may not be representative of the broader unlabelled data (Settles and Craven, 2008;Roh, Heo and Whang, 2019).
Therefore, to evaluate the final model, we used an unlabelled dataset that contained 2,735 hours of audio recorded from four sites with solar-powered bioacoustic recorders and five sites with Audiomoth recorders. These sites were independent of those used for the model training. The final model was run over the entire test dataset to predict if each audio frame (1.8 seconds in duration, with a 50% overlap between frames) contained a target call, resulting in c. 10.9 million predictions. We created a test dataset using a stratified random sampling approach. We selected a random sample of 500 audio frames from each 0.01 increment of prediction scores. Prediction scores were in the form of logits on the scale of 0 to 1 (refer to Section 2.3.2). When 0.01 logit increments had less than 500 audio frames, we included all audio frames within that logit increment. In total, the test dataset included 12,278 audio frames. The primary annotator manually labelled all audio frames in the test dataset.
J o u r n a l P r e -p r o o f Journal Pre-proof 2.4.2

Managing uncertain labels
Automated call recognition relies on the accurate identification of calls within the sample data. As noted previously, even highly trained experts can have difficulty achieving high annotation reliability from real-world recordings. In our study the accuracy of annotations is limited by factors such as similar vocalisations of co-occurring species, attenuation of the target calls at distance and overlap with other environmental noise.
Experience of annotators within our study suggested certain sounds cannot be definitively labelled as either 'SBTF' or 'Not SBTF'. These include calls that were substantially attenuated by distance from the microphone or were obscured by other sounds in the same frequency range, as well as short contact calls of SBTF, which are acoustically similar to other estrildid finches within the Study Area. We therefore allowed annotators to label audio frames into three categories, 'SBTF', 'Uncertain' and 'Not SBTF', following rules supplied to each annotator (provided in the Supplementary Data).
For our model, we used a 'classification with a reject option approach', which allowed audio frames to not be classified (i.e. labelled as uncertain) when the model predictions were ambiguous (Bishop, 2006;Thakur et al., 2019). Classification with rejection uses a threshold value that determines the threshold above which the model's predictions are labelled as

Model performance
We evaluated the performance of the final model and all preceding active learning iterations against the test dataset. We converted all labels and model predictions to an ordinal scale, with 'Not SBTF' assigned a value of 1, 'Uncertain' a value of 2, and 'SBTF' a value of 3. We evaluated model performance using macro-averaged mean absolute error (MMAE). MMAE is a 'standard metrics' (Cardoso and Sousa, 2011). We chose an analysis suitable for ordinal data as the 'uncertain' label is considered closer to the label 'SBTF' and 'Not SBTF' than 'SBTF' and 'Not SBTF' are to each other. We estimated confidence intervals of the MMAE using bootstrapping (n = 10,000). We conducted all analyses using the imblearn version 0.8 package in Python.

Inter-annotator agreement
Two experts (B.D. and E.M.) independently annotated a random subset of 9.5% (n = 1,165) of the test dataset as secondary annotators. Both experts had over 10 years of experience working on the species. The annotation process was blind, with each expert not receiving annotations from the other expert or primary annotator (J.V.O.).
We assessed inter-annotator agreements using the agreement coefficient Krippendorff's alpha (α) (Krippendorff, 2011;Monarch, 2021). We then bootstrapped (n = 10,000) the distribution of each agreement coefficient and calculated 95% confidence intervals, as recommended by Hayes and Krippendorff (2007). It is generally regarded that values above 0.8 represent good agreement among annotators, while values within the range of 0.667 to 0.8 represent tolerable agreement among annotators (Reidsma and Carletta 2008).
We assessed agreement among the final model, the primary annotator and expert secondary annotators using pairwise combinations of α and tested for differences among the pairwise combinations using a Kruskal-Wallis Test. We tested the effect of excluding 'uncertain' labels during model evaluation by calculating agreement coefficients with 'uncertain' labels retained and removed using a Wilcoxon rank-sum test. We also tested for a difference between the agreement coefficients of data recorded on solar-powered bioacoustic recorders compared to Audiomoth devices using a Wilcoxon rank-sum test. We undertook all significance tests using Scipy version 1.10.0.

CNN model development with active learning
The CNN model performance stabilised within five active learning iterations (Figure 4). The majority of model performance improvement was achieved within the first two iterations, with a 34.3% decrease in macro-averaged MAE between iterations one and three, compared to  . Agreement among all annotators, including the primary annotator and two experts, for audio frames captured using solar-powered bioacoustic recorders (Frontier Labs) and Audiomoths (Hill et al., 2019). Error bars show 95% confidence intervals.

Discussion
Our study demonstrates that active learning is an effective strategy for building machine learning (ML) classifiers for species with limited labelled training data. Using as few as five active learning iterations, we generated a final labelled dataset of 1,073 target calls and 5,786 non-target sounds. This was sufficient to create a model with classification abilities comparable to experts familiar with the species. Active learning focusses scarce human resources to annotate data that are most valuable for model performance (Van Engelen and Hoos, 2020), primarily records with greater uncertainty. In our study, we found that the active learning framework selected audio frames for annotation that included non-target calls with similarity to the target call, as well as sounds that were dissimilar to the target call due to over-fitting in early model iterations. Over-fitting occurs when the model learns noise in the training data instead of the underlying pattern of the target call (Bishop, 2006).
Annotation of these uncertain calls had a measurable impact on model performance. As such, active learning can reduce the time and effort required to develop labelled training data for building classifiers (Priyadarshani, Marsland and Castro, 2018;Stowell, 2022).
Active learning reduces the cost of developing labelled datasets, which form the foundation of successful call detection models (Van Engelen and Hoos, 2020). We estimate that 0.028% of our unlabelled data comprises the target call, making manual screening of our unlabelled data resource intensive. Manually obtaining an equivalent number of target calls to our final J o u r n a l P r e -p r o o f Journal Pre-proof dataset (n = 1,073) by reviewing our unlabelled audio data would have required the review of approximately 1,916 hours of audio data. Active learning provided an approach to overcome this barrier to developing a call detection model for the target species. As many wildlife species do not have existing labelled datasets, an active learning approach has general applicability to the field of bioacoustics, particularly for species whose calls are rare in longterm field recordings (Ricci, Rokach and Shapira, 2022). Development of open access methods to improve call recognisers for these rare and difficult to survey species, many of which are also threatened (Loiseau et al., 2020), will contribute to research and monitoring that removes data deficiencies for these species and thus improves policy and conservation outcomes (Davies, Margules and Lawrence, 2004;Sekercioglu et al., 2008;Loiseau et al., 2020).
Active learning is a powerful method for bioacoustic deep learning, but it poses certain challenges that deviate from the 'standard recipe' of bioacoustic deep learning described by Stowell (2022). One challenge is that the active learning process biases labelled data to the model since the model's predictions guide the annotation process. Resulting labels therefore cannot be used in the model's evaluation (Roh, Heo and Whang, 2019;Ricci, Rokach and Shapira, 2022). Chambert et al. (2018) and Ruff et al. (2020) addressed this issue by selecting audio frames for review post-processing based on their model's predictions. We extended this approach to account for a highly imbalanced dataset. Evaluation of a highly imbalanced dataset through random selection alone would require unfeasibly large numbers of audio frames to be manually reviewed to ensure sufficient target calls were captured to give reliable evaluation metrics (Raeder, Forman and Chawla, 2012). For example, where target calls make up 0.028% of the unlabelled data, which is the case for our unlabelled data, approximately 357,000 audio frames would require manual review to capture 100 target calls.
To overcome this challenge, we applied a random selection approach that was stratified across the model's predictions (logits). While this approach substantially reduced class imbalance within our evaluation data and allowed for the calculation of reliable evaluation metrics (Raeder, Forman and Chawla, 2012), the nature of this approach alters the distribution of data and prohibits evaluation metrics being generalised to the unlabelled data.
Additional research is needed to investigate more appropriate evaluation methods for highly imbalanced and unlabelled test data.
An active learning strategy requires consideration of biases that may be introduced during annotation (Monarch, 2021). While annotation bias is a consideration for all ML datasets, the labels, which suggests that the model is replicating biases of the training data (Roh, Heo and Whang, 2019;Koenecke et al., 2020). Performance improvements may therefore be achieved through the inclusion of labelled training data from additional annotators experienced with the species, or further investigation of the vocal repertoire of the target species compared to potential false positives. Broadly, our results highlight the importance of considering annotation biases, particularly for datasets created by a small group of annotators.
The use of multiple annotators to label training data requires consideration of inter-annotator agreement and how this affects ML model performance. Classifying wildlife vocalisations with certainty is unfeasible for many species. While ML models may theoretically be able to outperform inter-annotator agreement, the level of inter-annotator agreement remains a useful benchmark to assess model performance (Boguslav and Cohen, 2017;Richie, Grover and Tsui, 2022). In the field of natural language processing, where labels are often subjective, inter-annotator agreement metrics are a common tool to assess label quality and model performance (Pustejovsky and Stubbs, 2012). Our results support these arguments with the congruence among annotators providing a useful benchmark to assess model performance and identify potential areas of improvement in the training data.
The standard approach to build a bioacoustic classification model with ML often explicitly or implicitly assumes that labels are accurate, with limited scope for the inclusion of ambiguous vocalisations (Cabitza et al., 2020;Otani et al., 2020;Campagner et al., 2021). Our results show that exclusion of ambiguous vocalisations from the evaluation dataset significantly inflated the model's evaluation metrics. Our findings therefore agree with those of Cabitza et al. (2020) from the medical literature, that model performance is overestimated if interannotator agreement is not accounted for. While the effect size for our data was small, due to a good agreement among annotators, the overestimation of model performance is theoretically negatively correlated with inter-annotator agreement (Cabitza et al., 2020). We recommend that the development of models for species with ambiguous vocalisations account for inter-annotator agreement in design. We demonstrated how this can be accomplished using multiple annotators and adoption of a classification with a 'reject' option approach (i.e. the 'uncertain' category), which allows the model results to be directly compared to those of the annotators (Bishop, 2006;Campagner, Cabitza and Ciucci, 2019). Other approaches applied within the natural language processing and medical imaging fields include the development of a 'gold-standard' set of labels through the use of multiple annotators or label cleaning (Pustejovsky and Stubbs, 2012;Karimi et al., 2020), or incorporating annotator or range of SBTF calls than bioacoustics recorders placed at foraging and drinking sites, including numerous softer calls unlikely to be heard at distance (see Shephard, Pridham and Forshaw, 2012). Annotators would likely differ in their familiarity with these softer calls, resulting in lower inter-annotator agreement on Audiomoth recordings when compared with recordings from bioacoustic recorders (which are more likely to have captured the more commonly heard, louder and distinctive 'long'/'pew' call of SBTF).
Differences in the volume/amplitude and quality of recordings captured by bioacoustic recorders and Audiomoths may also have affected annotators assessment of calls, with bioacoustics recorders having a signal-to-noise ratio (SNR) of 80 dB compared with 42 dB in Audiomoths (Roe et al., 2021;Open Acoustic Devices, n.d.). The higher SNR of bioacoustic recorders would afford greater sound fidelity and less background noise, thereby providing recordings that align more closely with the annotators' field experience (Turgeon, Van Wilgenburg and Drake, 2017; Darras et al., 2020).

Conclusions
Our study investigated the efficacy of active learning as a framework for building deep learning models in a setting where limited training data were available and the cost to obtain training data was high. Using field recordings of a threatened species, the southern blackthroated finch, we successfully developed and demonstrated the value of an open access active learning method that considered imbalanced and unlabelled data and ambiguous vocalisations, which are common barriers to constructing call recognition models for rare or cryptic species, and other species that vocalise infrequently. Our results demonstrate the utility of these methods in developing effective call recognisers for rare and difficult to survey species. 10.6084/m9.figshare.23053382.
Note: the above DOI will be hyperlinked to the final dataset. A draft dataset for peer review can be accessed through this link: https://figshare.com/s/b1377829938b276b17ea. The dataset will be finalised and published following peer review comments. J o u r n a l P r e -p r o o f Journal Pre-proof