A Convolutional Neural Network and R-Shiny App for Automated Identification and Classification of Animal Sounds

The use of passive acoustic monitoring in wildlife ecology has increased dramatically in recent years as researchers take advantage of improvements in automated recording units and associated technologies. These technologies have allowed researchers to collect large quantities of acoustic data which must then be processed to extract meaningful information, e.g. target species detections. A persistent issue in acoustic monitoring is the challenge of processing these data most efficiently to automate the detection of species of interest, and deep learning has emerged as a powerful approach to achieve these objectives. Here we report on the development and use of a deep convolutional neural network for the automated detection of 14 forest-adapted species by classifying spectrogram images generated from short audio clips. The neural network has improved performance compared to models previously developed for some of the target classes. Our neural network performed well for most species and at least satisfactory for others. To improve portability and usability by field biologists, we developed a graphical interface for the neural network that can be run through RStudio using the Shiny package, creating a highly portable solution to efficiently process audio data closer to the point of collection and with minimal delays using consumer-grade computers.


Introduction 27
Artificial intelligence (AI) technologies are increasingly being applied to issues in 28 ecological research and conservation. In the field of wildlife ecology, the use of AI in 29 combination with recent advances in survey techniques has enabled researchers to collect data on 30 species occurrences at much broader spatial and temporal scales than were previously possible. 31 Passive monitoring methods such as camera traps and bioacoustics have greatly improved the 32 capacity of researchers to survey for wildlife, but the resulting large datasets require substantial 33 processing to extract useful information. The task of quickly and accurately locating signals of 34 interest (e.g., target species detections) within large audio or photo datasets remains a persistent Some of the target classes were included because the species in question were of 91 ecological or management interest due to potential competitive or predatory interactions with 92 other species. For example, Townsend's chipmunks are important prey for many raptors and 93 mammalian predators. Other classes were added because previous verification efforts indicated 94 that they were likely to produce false-positive detections for existing target classes. Some target 95 species fulfilled more than one of these criteria; band-tailed pigeon is a managed game species 96 (Sanders 2015) that was extremely common at our study sites, and was a major source of false 97 positive detections for great horned owl in the Ruff et al. (2020) study, and Townsend's 98 chipmunk calls are easily confused with northern saw-whet owl calls. 99

Methods 100
Training data 101 Our training dataset included 53,292 unique clips of vocalizations from 14 species (Table  102 1). For eight of these species, including band-tailed pigeon, great horned owl, mountain quail, 103 northern pygmy-owl, northern saw-whet owl, northern spotted owl, red-breasted sapsucker, and 104 Townsend's chipmunk, the examples in the training set included only one highly-stereotyped call 105 type or sound. For another four species, namely common raven, pileated woodpecker, Steller's 106 jay, and western screech-owl, the training set included multiple call types, but the call types for 107 each species were lumped into one class because the component syllables were sufficiently 108 similar that including closely related call types would likely not hinder identification. For the 109 remaining two species, barred owl and Douglas' squirrel, we included two call types but 110 incorporated each call type as a separate class. Like Ruff et al. (2020) we included a catch-all 111 "Noise" class for any image without one of our target species, which we reasoned would 112 improve CNN performance for our target classes. 113 Training images were generated from annotated records of calls found in data from a 114 survey effort for northern spotted owls and barred owls (see Duchac et al. 2020 internal representation of each class would be less influenced by systematic regional differences 122 in vocalizations or background noise. 123 To augment the unique sound clips included in the training set we generated multiple 124 variant spectrograms with randomized offset and dynamic range, producing three to six distinct 125 images for each unique call, using the same procedure detailed in Ruff et al. (2020). 126 Spectrograms ( Fig. 1) consisted of grayscale images in portable network graphic (PNG) format 127 with resolution of 500 x 129 pixels and a bit depth of eight. We generated these images using the 128 spectrogram command in SoX (version 14.4, http://sox.sourceforge.net). After creating the 129 images, we reviewed the training set to ensure that each image contained a visible signature of 130 sounds corresponding to one of our target classes, but no other classes. We reserved a randomly 131 selected 20% of these images for the validation set and used the other 80% of images for training 132 the CNN. The training set included spectrograms used to train the Ruff et al. (2020) CNN, as 133 well as many additional images for those original target species and the ten additional classes. 134 Because many sounds that were previously considered "noise" corresponded to one of the new 135 target classes, we reviewed images included in the previous training set again to remove any 136 spectrograms that contained calls of multiple species. Following the generation of the variant 137 spectrograms, the final training dataset comprised 173,964 images representing our 17 classes 138 (Table 1). 139

Convolutional neural network training 140
We implemented the CNN model in Python (version 2.7, Python Foundation) using 141 Keras (Chollet 2015), an open-source, machine learning-focused application programming 142 interface to Google's TensorFlow software library (Abadi et al. 2015). The CNN contained six 143 trainable layers, including four convolutional layers and two fully connected layers. The first 144 convolutional layer contained 32 5x5 filters, the second layer contained 32 3x3 filters, and the 145 third and fourth layers each contained 64 3x3 filters. Each convolutional layer had a stride length 146 of one and used Rectified Linear Unit (ReLU) activation. Each convolutional layer was followed 147 by 2x2 max pooling and 20% dropout. The first fully connected layer contained 256 units using 148 ReLU activation and L2 regularization and was followed by 50% dropout. The landscape; this is a major difference between a model with softmax activation and one with 160 sigmoid activation. 161 We trained the CNN for 100 epochs using a batch size of 128 images. We measured loss 162 using the binary cross-entropy function and used the Adam optimization algorithm (Kingma and 163 Ba 2015) with an initial learning rate of 0.001. To prevent overfitting we saved the model after 164 epochs in which validation loss decreased. We also included a stepdown function to adjust the 165 learning rate during training: if the validation loss did not decrease by at least 0.025 for five 166 epochs, the learning rate was reduced by half. This was followed by a cooldown period of six 167 epochs; hence, the learning rate could diminish at a maximum rate of once every ten epochs. We 168 implemented the cooldown period based on the observation that improvements in model 169 performance during training are stochastic and therefore it might take several epochs to realize 170 the potential benefit of a given learning rate. We trained the CNN using an IBM POWER8 high-171 performance computer running the IBM OpenPOWER Linux-based OS with two Nvidia Tesla 172 P100-SXM2-16GB general-purpose graphics processing units. 173

Model testing 174
To evaluate the performance of the new model in a way that would require less human 175 validation and would not be biased by the verification procedure itself, we compiled an 176 independent test set of 131,767 images for which the correct labels were known and which had 177 not been part of the training or validation set (Table 1) identifying species by ear and by examination of the spectrogram. After assembling the test set 184 we used the presented version of the CNN to classify the images, and we report performance 185 metrics based on the class scores that it assigned to each image. 186 Here we report the same metrics that were reported by Ruff  The basic data processing pipeline entails the creation of spectrograms for each non-213 overlapping 12-second segment of audio in the dataset and then processing these images with the 214 CNN to provide predictions on which class the image should be assigned . 215 During data processing Ruff et al. (2020) used an intermediate step of segmenting the audio into 216 12-second clips and created spectrograms from those clips. Further testing (Z. Ruff,unpublished 217 data) revealed that an equally effective approach was to generate spectrograms representing 12-218 second segments directly from the long-form audio, which resulted in far less disk space used. 219 The set of clips representing all non-overlapping 12-second segments takes up as much disk 220 space as the original audio, while the spectrograms representing the same audio uses 221 approximately 2.5% as much disk space. After generating the full set of spectrograms the 222 program processes the images with the pre-trained CNN model and outputs an array of class 223 scores for each one. This array, along with the filename of the image, is written to a text file in 224 comma-separated value format. 225 For large-scale data processing we used one or more scripts written in Python version 226 2.7; these scripts carry out the basic steps of the data processing pipeline in sequence. To 227 increase overall speed of data processing we divided the tasks of spectrogram generation and 228 image classification between separate scripts and used additional scripts to divide the original 229 dataset into chunks of roughly equal size to be processed in parallel, which allowed us to take 230 advantage of computers with many processing cores and a large amount of available memory. 231 The basic processing pipeline described here is versatile, but the details of an optimized 232 implementation depend heavily on available computer resources. 233

Verification of model output 234
We used the Ruff et al. (2020) process as guidance to verify output from the CNN, but 235 we made several important modifications. We constructed a list of audio clips to be reviewed 236 based on the class scores assigned to the corresponding images, as contained in the full output 237 file generated through CNN processing. We first extracted all rows with class scores ≥0.25 for 238 northern spotted owl; these are considered potential spotted owl detections first and foremost, 239 regardless of the scores assigned to them for other classes. We then extracted rows with class 240 scores ≥0.95 for any other non-Noise class, which we considered potential detections for the 241 class with the highest score. We extracted these potential detections from the original audio as 242 12-second clips, which were sorted into directories by target class. Within the directory for a 243 given target class we further divided potential detections by time, creating a folder for each week 244 that an ARU was deployed. We reviewed these short audio clips and corresponding spectrograms The learning rate stepdown function was invoked a total of 10 times (i.e., as often as was 277 possible given the patience and cooldown periods that we specified), the last being at epoch 97, 278 and the final learning rate was 9.77x10 -7 , a 1,000-fold reduction from the initial learning rate of 279 0.001. The full training run took approximately 12.5 hours. 280 Test set performance 281 Precision (Fig. 2), recall (Fig. 3) and F1 scores (Fig. 4) for the CNN's performance on the 282 test set varied for the 16 non-Noise classes across a range of thresholds. Performance was 283 generally stronger for owls than for the other species. Among the owls, precision was highest for 284 northern saw-whet owl and northern pygmy-owl, although precision at higher thresholds 285 exceeded 90% for all owl species except great horned owl, for which precision was noticeably 286 lower at all thresholds (Fig. 2). Among other species, precision was highest for Townsend's 287 chipmunk, band-tailed pigeon, Steller's jay, and common raven (Fig. 2). Precision was relatively 288 low for Douglas' squirrel chirp call, mountain quail, and red-breasted sapsucker (Fig. 2). 289 Precision was lowest for the Douglas' squirrel rattle call and did not exceed 50% for this class 290 even at the highest threshold of 0.99 (Fig. 2). 291 Among the owl species, recall was best for spotted owl, barred owl (both call types), and 292 northern saw-whet owl, somewhat lower for northern pygmy-owl and great horned owl, and 293 lowest for western screech-owl across most of the range of thresholds, although recall was well 294 above 50% for most species even at thresholds of 0.9 or more (Fig. 3). Recall for other species 295 was less consistent and was highest for pileated woodpecker and Townsend's chipmunk, 296 moderate for band-tailed pigeon and common raven, lower for mountain quail and Steller's jay, 297 and lowest for red-breasted sapsucker and Douglas' squirrel (Fig. 3). Most classes showed recall 298 above 50% at thresholds >0.9. 299 The plots of F1 score versus threshold indicated that the CNN had a fairly balanced mix 300 of precision and recall for most of the owl classes across a broad range of thresholds, as 301 demonstrated by the flatness of the curves at moderate threshold values (Fig. 4). Great horned 302 owls had markedly better F1 scores at higher thresholds, which may be attributable to the low 303 precision for this class across most of the range of thresholds. F1 score peaks at low threshold 304 values for several species of owl, including northern pygmy-owl, northern saw-whet owl, and 305 western screech-owl (Fig. 4). These species had high precision at low threshold values which 306 appeared to then be offset by diminishing recall at higher thresholds. Similar patterns were 307 visible for non-owl avian species, although these covered a broader range of values. We 308 observed the best F1 scores for pileated woodpecker, common raven, and band-tailed pigeon, 309 depending on threshold (Fig. 4). Mammals showed a wide range of F1 scores; Townsend's 310 chipmunk was comparable to the owls, while Douglas' squirrel showed low F1 for the chirp call 311 and lower F1 for the rattle call (Fig. 4). 312 The ROC curves show close to ideal performance for most owl species, good 313 performance for several of the other bird species including band-tailed pigeon and pileated 314 woodpecker, as well as Townsend's chipmunk, and somewhat weaker performance for common 315 raven, Steller's jay, and both Douglas' squirrel classes (Fig. 5) Steller's jay = 0.602, Townsend's chipmunk = 0.864 (Fig 6). 332

Discussion 333
Here we present an improvement in CNN performance by retraining with larger set of 334 images and by increasing the number of target classes, resulting in a demonstratively higher-335 performing CNN compared to the network previously reported in Ruff et al. (2020). 336 Additionally, we packaged the CNN as a desktop application run through Rstudio, making the 337 benefits of this tool available to field biologists and practitioners in a portable and user-friendly 338 interface that requires only free and widely available software. Our CNN had substantially better 339 performance than the Ruff et al. (2020) version for six owl species as demonstrated by area 340 under curve values for both the ROC and the PR curves for each species. To demonstrate the 341 ability to have the CNN distinguish multiple call types for a species we added the barred owl 342 inspection call class, which showed extremely strong performance with distinguishability from 343 the 8-note call, providing progress towards automating detailed biological meaning from passive 344 acoustic data well beyond simple species presence. We also incorporated eight additional 345 species, for which performance was mixed but still strong enough to use CNN output as the 346 foundation for ecological analyses for most target species. Performance for the non-owl classes 347 was comparable to that observed for the original six owl classes as reported by Ruff et al. (2020), 348 which suggests that performance for these classes may improve to a similar degree in future 349 versions. Manual review of apparent detections by humans has the side benefit of producing 350 training data which can be fed back into the network to improve performance in an iterative 351 fashion by periodically retraining the model. 352 The use of softmax activation in the output layer of the CNN used by Ruff et al. (2020) 353 implied that class labels were exhaustive and mutually exclusive, i.e. each image was implied to 354 have exactly one correct label. This is not strictly true because in natural systems it is not unusual 355 for multiple species to vocalize simultaneously. Because the number of target species for the 356 Ruff et al. (2020) CNN was relatively small, the inability to recognize multiple classes in the 357 same image was unlikely to be a major limitation, since single-class images would likely 358 outnumber multi-class images in most cases. However, the lists of target species are likely to 359 expand with future CNN developments, so multiple target classes occurring in some images will 360 be more common. As such it will be increasingly important that these multi-class events be 361 accurately captured in the model output as multi-class predictions. The use of sigmoid activation 362 in the output layer of the reported CNN should allow for multi-label classification, in which a 363 single image can receive high scores for multiple classes. In practice, we did not find that the 364 CNN reliably assigned high scores to each appropriate class when multiple species were present 365 in an image. This may be because our training set contained only singly labeled images. These 366 results suggest that effort should be made to include images with multiple correct labels -367 perhaps even generating them artificially by combining multiple single-class imagesto train 368 the CNN to more reliably recognize multi-class images. Alternatively, recent work has obtained 369 strong multi-label performance using "pseudo-labeled" training data, in which each training 370 example has one class labeled as present or absent and all other classes labeled as unknown, and 371 a custom loss function which penalizes incorrect prediction for only the labeled class (Zhong et 372 al. 2020). 373 We found that recall for non-owl birds and mammals was well below 100% even at very 374 low thresholds (e.g. 0.05), suggesting that the CNN assigned a very low score to the correct class 375 in a significant number of these cases. This was also true for western screech-owl, though less 376 dramatically so. A larger proportion of test images of the non-owl classes featured calls of 377 multiple species and therefore had multiple correct labels. Because the CNN did not reliably 378 assign high scores to every class that was present in multi-class images, the lower recall observed 379 for non-owls may be an artifact of our relatively modest test set rather than a feature of the CNN 380 itself. 381 Precision was noticeably weak for the Douglas' squirrel rattle call even at high 382 thresholds. This may be attributable to both the character of the call itself and our specific 383 processing pipeline. The rattle call consists of an extended sequence of rapidly repeated (~15 s -1 ) 384 chirps. The speed of the call combined with the time resolution of our spectrogram images (500 385 pixels representing 12 s of recording time) means that even in cases with a high signal-to-noise 386 ratio, individual chirps may be separated by as little as one pixel, and this separation may 387 effectively vanish when combined with echoes, scattering, and ambient sounds. This call is also 388 highly variable in length, although this does not seem to have hindered classification for other 389 calls such as Townsend's chipmunk or northern saw-whet owls. It is possible that our training set 390 for this species simply did not establish a sufficiently distinctive pattern, given the resolution of 391 the images, to enable the CNN to disregard similar sounds. Common sources of false positives 392 for the Douglas' squirrel rattle call were wind, insects, and anuran calls. In spite of some noted 393 issues, the performance of our CNN was broadly comparable to that of another recent CNN 394 which achieved precision ranging from 0.13 to 1.00 and recall ranging from 0.25 to 1. convolutional neural networks is computationally intensive and benefits from high-performance 400 computer hardware, particularly powerful graphics processing units, the actual processing of 401 audio data can be done at a reasonable speed on consumer-grade computers. Because the task of 402 generating spectrograms can be parallelized to a substantial degree, this task makes efficient use 403 of multi-core processors, and the availability of inexpensive 8-and 12-core central processing 404 units makes desktop processing increasingly attractive. Classifying the resulting images is not 405 computationally demanding and can be run at reasonable speed without relying on powerful 406 graphics processing units. Depending on the specific hardware configuration, processing speed 407 may be limited either by the data connection or the read-write speeds of the storage media. We 408 have obtained the best processing speeds with data stored on internal solid-state drives with high-409 speed data connections; however, external hard drives connected by universal serial bus still 410 offers satisfactory performance. The desktop application also included functions for extracting 411 rows representing potential detections from the raw results file and for extracting short clips 412 corresponding to these rows for subsequent verification. 413 Moving beyond the ability to process audio on consumer-grade computers, the 414 advantages of large-scale passive acoustic monitoring may only be fully realized when the raw 415 data can be processed in close to real time (i.e., as they are collected) and salient results 416 communicated quickly to biologists and managers. This may entail processing data in a 417 distributed fashion using small, inexpensive system-on-chip processing devices coupled to the 418 recording device, or the use of purpose-made recording devices with software for onboard 419 processing, e.g. AudioMoth (Hill et al. 2017, Prince et al. 2019). Such distributed processing 420 nodes could communicate potential detections to biologists remotely over mobile data networks, 421 streamlining the process of retrieving data from the field and allowing for very rapid responses to 422 emergent issues at field sites (e.g. the Rainforest Connection project; https://rfcx.org). However, 423 engineering such an all-in-one solution to remain active for weeks at a time and to withstand the 424 environmental conditions typical of many field sites is a non-trivial challenge. These 425 developments will require multi-disciplinary collaborations between ecologists, computer 426 scientists, and engineers. From a species conservation and management standpoint these 427 advancements will be crucial to enhance our ability to monitor target species in close to real 428 time. This is especially true for species such as northern spotted owls which are rare, elusive, and 429 have habitats that are often subject to land management actions with economic and ecological 430 implications. 431 432 Table 1. Target species and the characteristic sounds used to train the convolutional neural  586 network and construct the test set. Each row denotes a separate class and corresponds to one 587 node in the network's output layer. We had a number of unique audio clips representing each 588 class, from which we generated spectrogram images that we used to train the network. Each class 589 was a specific call type or group of vocalizations with similar syllables. We generated three to 590 six variant spectrograms with slightly different parameters for each clip to increase the volume of 591 training data. During processing, long audio clips are segmented into spectrograms, each 592 representing 12 seconds of audio. For each image the trained network outputs a vector of 17 593 class scores, each between zero and one, representing the strength of the match between the 594 image and each of the target classes. The test set (n = 131,767) was comprised of images 595 generated from examples of the same call types that were not used to generate spectrograms for 596 the training or validation set. Some images in the test set contained calls from >1 target class. 597 False Positives], considering only clips with class score exceeding the detection threshold for 617 each target class. Precision represents the proportion of apparent "hits" that correspond to real 618 instances of the class in question. 619  with both precision and recall calculated at a specific detection threshold. F1 score is intended as 627 a balance of precision and recall and is used to gauge overall model performance. 628 given model's performance and the magnitude of the trade-off between these metrics across a 636 range of detection thresholds. 637