Automatic detection and taxonomic identification of dolphin vocalisations using convolutional neural networks for passive acoustic monitoring

A novel framework for acoustic detection and species identification is proposed to aid passive acoustic monitoring studies on the endangered Indian Ocean humpback dolphin ( Sousa plumbea ) in South African waters. Convolutional Neural Networks (CNNs) were used for both detection and identification of dolphin vocalisations tasks, and performance was evaluated using custom and pre-trained architectures (transfer learning). In total, 723 min of acoustic data were annotated for the presence of whistles, burst pulses and echolocation clicks produced by Delphinus delphis (~45.6%), Tursiops aduncus (~39%), Sousa plumbea (~14.4%), Orcinus orca (~1%). The best performing models for detecting dolphin presence and species identification used segments (spectral windows) of two second lengths and were trained using images with 70 and 90 dpi, respectively. The best detection model was built using a customised architecture and achieved an accuracy of 84.4% for all dolphin vocalisations on the test set, and 89.5% for vocalisations with a high signal to noise ratio. The best identification model was also built using the customised architecture and correctly identified S. plumbea (96.9%), T. aduncus (100%), and D. delphis (78%) encounters in the testing dataset. The developed framework was designed based on the knowledge of complex dolphin sounds and it may assists in finding suitable CNN hyper-parameters for other species or populations. Our study contributes towards the development of an open-source tool to assist long-term studies of endangered species, living in highly diverse habitats, using passive acoustic monitoring.


Introduction
Accurate remote sensing tools used to investigate wildlife populations are critical for long-term monitoring and effective conservation actions, especially for endangered species.Passive acoustic monitoring (PAM) has been extensively used to investigate endangered dolphin populations (Dong et al., 2017;Jaramillo-Legorreta et al., 2017;Munger et al., 2016), and machine learning techniques have been employed to improve the accuracy and speed of acoustic detection (Bergler et al., 2022;Caruso et al., 2020;White et al., 2022;Ziegenhorn et al., 2022).Despite the widespread use of PAM, relatively few tools are available that detect and identify these sounds in archived recordings (Bergler et al., 2022;Gillespie et al., 2009;Sugai et al., 2019).Additionally, the lack of annotated sounds in openly available datasets precludes further development of complex machine learning models for dolphin sounds detection and species identification as a large amount of data are needed (Jordan and Mitchell, 2015).Conservation actions and population monitoring using PAM are thereby limited for some species, particularly those living in noisy habitats where sympatric species emit similar acoustic signals.Effective classifiers are required to identify species of interest in highly complex ecosystems (Ziegenhorn et al., 2022).
The development of effective tools using PAM techniques, designed for the monitoring and conservation of the endangered Indian Ocean humpback dolphin (Sousa plumbea) in South Africa, was the catalyst for this study.Toothed whales rely on acoustic communication for biological success, using a variety of functionally specific signals, such as tonal whistles or broadband pulse bursts in social interactions, as well as echolocation clicks for navigation and feeding.Humpback dolphins in South Africa inhabit shallow rocky and sandy-bottom shore zones of a very heterogeneous habitat along the southwesterly portion of this species' distribution (Best and Folkens, 2007), which could negatively affect the detectability of specific sounds due to the noisy environment (Shabangu et al., 2022).Additionally, the use of coastal habitats increases their interaction with human activities (Plön et al., 2015), such as boat traffic (Karczmarski et al., 1998), which not only contributes to the soundscape as noise (Schoeman et al., 2022), but can also mask the sounds produced by dolphins and interfere on both natural communication (Fouda et al., 2018;Jensen et al., 2009) and monitoring of wild populations.Despite the potential significance of passive acoustics in monitoring humpback dolphins (Sousa spp.) (Bopardikar et al., 2018;Dong et al., 2017;Yang et al., 2020), its application in long-term recordings is still constrained in South African waters, as there are no available automated classifiers to differentiate their sounds from other dolphin species that are present in the area.Humpback dolphins from the southern Indian Ocean have an overlapping distribution with at least three other whistling dolphin species, most commonly, the Indo-Pacific bottlenose dolphin (Tursiops aduncus), the common dolphin (Delphinus delphis), and the killer whale (Orcinus orca) (Peddemors, 1999).
The highly diverse vocal repertoire of delphinids (Odontoceti: Delphinidae) reflects the complexity of their cognitive abilities due to a strong social component (Fox et al., 2017).Despite this, their vocal production structures share similar morphological adaptations (Mead, 1975).However, slight variations in size (Jensen et al., 2018) and head shape of some species (e.g., S. plumbea) (Frainer et al., 2021;Song et al., 2022) may result in convergence on similar sound production capabilities with other species (e.g., T. aduncus and D. delphis) and potentially affect the accuracy of identification tasks (Yang et al., 2020).Humpback dolphins exhibit adaptations on the left side of their epicranial complex that may allow them to produce more directional and higher frequency communication sounds compared to bottlenose dolphins (Tursiops spp.) (Frainer et al., 2019).Such sounds, for example whistles, overlap in spectral frequency with those produced by T. aduncus and D. delphis (Erbs et al., 2017;Fearey et al., 2019;Gridley et al., 2014).Although most of the studies on this topic have investigated the differences across species using specific calls such as whistles (Erbs et al., 2017;Oswald et al., 2008;Oswald et al., 2021) or clicks (Buchanan et al., 2021;de Freitas et al., 2015;Luo et al., 2019;Temple et al., 2016;Yang et al., 2020), few studies have integrated multiple sound types as input for species classification tasks (Rankin et al., 2017).
In this study, we assessed the applicability of Convolutional Neural Networks (CNNs) for dolphin monitoring in long-term recordings using their complete vocal repertoire along with a model prediction and postprocessing approach for automated taxonomic identification.Although prior studies have shown the effectiveness of CNNs in detecting and identifying whale (Allen et al., 2021) and dolphin sounds (Buchanan et al., 2021;Duan et al., 2022;Erbs et al., 2023;Luo et al., 2019;Nur Korkmaz et al., 2023), a multi-class classifier that encompasses all the dolphin species occurring in South African waters has yet to be developed.The proposed framework designed here combining biological knowledge on sound production in dolphins, and innovative machine learning tools, may enhance the use of PAM for target species in highly diverse areas.Improving remote sensing techniques to monitor the population dynamics of the endangered humpback dolphin in South Africa via their vocalisations (Longden et al., 2020;Wang et al., 2020) would be a critical stride towards the development of an automated and long-term monitoring system and effective conservation management strategies.This represents the first tool of this nature for South Africa and will be available for ecologists, management teams, and researchers.

Data collection
To build the training library, boat-based focal follow recordings were used to record the vocalisations of four whistling coastal dolphin species that inhabit South African waters (Fig. 1).The recordings were made using a SoundTrap 300HF (flat frequency response of 20 Hz -150 kHz ± 3 dB; Ocean Instruments Inc., New Zealand), or HTI-96-MIN hydrophones (flat frequency response of 2 Hz -30 kHz ± 1 dB; High Tech Inc., U.S.) attached to a TASCAM DR 680 recorder (TASCAM, U.S.) (Supporting Information A) and were stored in .wavfiles.Hydrophones were set approximately four metres deep during dolphin encounters, with signals digitised at 96 kHz sample rate or higher in continuous recordings.Dedicated visual surveys were performed during all boatbased recordings to ensure that no other species were present in the area, i.e., data from mixed-species groups were not included in the analysis.Additionally, recordings made through moored instruments in Mossel Bay were obtained between March 19th and April 4th of 2021, using a SoundTrap 300HF sampling at 96 kHz at five meters depth (Fig. 1).The presence of S. plumbea and T. aduncus in the vicinity of the devices was confirmed by land-based observations from the harbour wall, located approximately 100 m away from the mooring.The close proximity of the dolphins to the recorder, combined with the simultaneous capture of strong signals by the devices during visual observations, confirmed the correlation between sound and species identification.Two confirmed D. delphis encounters between the 15th and 16th of May 2021, in False Bay, were recorded using a SoundTrap 300HF hydrophone attached four metres deep to a free-drifting buoy.Furthermore, to validate the single-species encounters, visual observations were conducted from a boat positioned roughly 400 m away from the drifting buoy.A moored SoundTrap 300HF sampling at 576 kHz at ~10 m depth was deployed between the 31st of January and the 2nd of February 2021, in Fish Hoek, Cape Town to record O. orca sounds during four days of a confirmed sighting in the area (i.e., reports from whale watching networks and personal observation) (Fig. 1).In this case, a male O. orca was sighted during consecutive days close to the moored hydrophone, through visual observations from a boat.The unique complex calls from O. orca (Miller and Bain, 2000) confirmed the species identification of the vocalisations.The moorings used in this study were attached to a rope that was suspended, along the water column, by a subsurface buoy.The moored hydrophones were then attached approximately two meters from the bottom, and all the moored and freedrifting recordings were made in continuous recordings (Supporting Information A).

Training dataset and testing dataset
Dolphin whistles, burst pulses, and echolocation click trains were inspected aurally and visually, using spectrograms (FFT length = 1300; hop size = 650; Hann window; with smoothing applied), and manually annotated using Raven Pro 1.6 (Cornell Lab of Ornithology, 2023).The labelled dolphin vocalisations varied from short whistles and burst pulses to long segments with more than one vocally active animal, including big groups (>100 animals, e.g., D. delphis and T. aduncus) (Fig. 2).Soundscapes, comprised of non-dolphin biological (e.g., fish, snapping shrimp, reef), geophonic (e.g., rough seas, rain), and anthropic sounds (e.g., chain noise, boats) were also manually annotated (Dufourq et al., 2021;Stowell et al., 2019) to represent the naturally occurring soundscape in the absence of dolphins.The start and end of each annotation were recorded, as well as the duration of each segment.For the testing dataset, vocalisations were categorised according to the amount of noise masking, interpolated from the signal-to-noise ratio graded from one (i.e., masked/weak signal) to three (i.e., strong and clear signal).The visually monitored recordings from moored hydrophones in Mossel Bay were used as 'unseen data' to test the generalisability of our tool.Similarly, D. delphis recordings from a freedrifting buoy, as well as O. orca sounds from the moored hydrophone, were only used to test the species identification model (Supporting Information A).

Pre-processing
To ensure consistent sampling rates, audio recordings with a sampling rate above 96 kHz were downsampled to 96 kHz.To create the training set, a sliding window approach was used to extract segments of sound with equal length (user defined hyper-parameter) from the annotated events (Dufourq et al., 2021), in which segments were sampled in series based on their start and end times.The segment start times were interspaced one second apart from each other to sample dolphin vocalisations in different contexts.We compared the accuracy of the models by varying the windows sizes (2, 3, 5, and 7 s) to determine the best parameter.These window sizes refer to the shortest segment possible (i.e., two seconds) and the longest segment that can cover at least the longest dolphin vocalization (e.g., O. orca complex calls).All segments were augmented by randomly mixing dolphin sounds with target soundscapes from where the classifier would be applied; in our case, Mossel Bay.The new segments contained a proportion of both dolphin (90%) and soundscape (10%) sounds; to elucidate a potential detection of species in the target area.The amount of augmentation for species was scaled up relative to the number of segments generated for the species with the largest amount of data, which was only duplicated due to the large number of clips generated (i.e., D. delphis, with 20,319 clips generated and 40,638 spectrograms created).We also balanced each species dataset per encounter to ensure equal distribution for the sounds produced in different contexts (see Discussion section).The class distribution was also balanced after the augmentation process, based on the class with the smallest dataset to ensure balanced datasets.
To test the efficacy of our models, we created several segments by using the same sliding window approach.Namely, we used the same window size that was used in training, and thus multiple segments were created across the entire testing file by moving the window by one second in the moored recording.We converted each of these testing segments into spectrograms (FFT length = 1024; hop size = 128; Hann window) which were used as input for subsequent model prediction.All generated spectrogram images were created as 5 × 5 in.but varied in their dpi configuration, ranging from 200 × 200 (40 dpi) to 500 × 500 (100 dpi) samples.The number of images used per class was constrained by our computational resources, and we used the maximum number of images possible in each case.We attempted a number of experiments and varied the number of classes.The largest dataset built comprised 80,000 images when combining three seconds window size and 40 dpi for the customised architecture (see Convolutional neural networks section), and the smallest one comprised 3900 images combining two seconds window size and 90 dpi for the transfer learning approach (Table 1).

Convolutional neural networks
Two CNN models were implemented to detect and identify dolphin sounds (Fig. 3).The first model (CNN1) was a binary classifier that was trained to detect the presence or absence of dolphin sounds.The second model (CNN2) was a multi-class classifier that was trained to differentiate between different species of dolphins.Two architectures were  (He et al., 2016) that demonstrated good performance in animal sound classification tasks (Dufourq et al., 2022).The customised models were composed of three convolutional layers (32 filters, kernel size of 4 × 4, ReLU activation).Each convolutional layer was followed by dropout (rate of 0.4) and a max pooling (kernel size of 4 × 4) layer.This was followed by a fully connected layer with 64 ReLU units, dropout (rate of 0.4), and a softmax function (two units in the case of CNN1, and three or four units in the case of CNN2 depending on the number of species).The models were trained for 50 epochs using the Adam optimizer (Kingma and Ba, 2014), with a learning rate of 0.001 and a batch  G. Frainer et al. size of 32.The most suitable architecture was chosen based on the best validation accuracy (proportion of all correct predictions) and precision (number of true positives divided by true positives and false positives) obtained during training.The model training and prediction procedures were executed on Microsoft Azure using instance NV12s v3 with 12 vCPUs and 112 GB RAM.The CNNs were implemented using Tensor-Flow (Abadi et al., 2016) and Python 3. The Ubuntu 20.04 operating system was used and obtained via the Ubuntu 20.04 Data Science Virtual Machine on Microsoft Azure.The algorithm scripts are available in Supporting Information B.

Inference and post-processing
CNN1 was applied to the unseen data to obtain softmax values indicating the likelihood of dolphin vocalisations within each testing segment.A post-processing technique was devised to group segments that were predicted as present and occurred within a 900 s timeframe of each other, and for which the model displayed a high degree of confidence (> 70%).The outcome of CNN1 determined the start and end times for each acoustic encounter (AE), which entails isolated calls occurring within at least 15 min of each other.The time between AE was determined based on ad hoc experimentation and can be easily adjusted during the inference step.Each AE was then assessed using CNN2 to assign a single species identification for all detected segments containing dolphin vocalisations.The taxonomic identification for an AE was determined by first using CNN2 to determine the species indications on each detected segment within the AE, and then the majority of taxonomic identification was assigned to the entire AE.The number of detections and the proportion of detections per species, as well as the start and end times (based on the files' name), and duration of the AE, are given in the output (see output example in Supporting Information B).

Model evaluation
The testing dataset was analysed by one experienced observer (GF) whereby dolphin echolocation clicks, burst pulses, and whistles were also manually annotated using Raven Pro 1.6 (Cornell Lab of Ornithology, 2023).A confusion matrix was then generated to compare the detected AEs by the CNN1 models against the ground-truth data, based on the time of correct/incorrect assignment (see Fig. 4).In this way, each second of the 24 h testing dataset was categorised as True Negative (TN), True Positive (TP), False Negative (FN), or False Positive (FP).The evaluation was performed for all dolphin sounds and, secondly, for all dolphins sounds with SNR higher than 1, which are considered useful for ecological studies (Gridley et al., 2015;Palmer et al., 2019).The models were assessed based on the accuracy, precision, sensitivity (recall), specificity, and F1 score: The performance of the species identification model (CNN2) was tested using moored or drifting recordings (see Data collection section) with verified species identification, and the accuracy for each species was reported.
To create the dataset, audio segments were extracted and augmented using the Microsoft Azure instance E96ias v4 with 96 vCPUs and 672 GB RAM.We chose a high performance machine aiming to execute the algorithm with as much data as possible instead of sub-sampling the dataset.The software was implemented using various Python 3

Table 1
Dolphin detection models performance based on the comparison of the time assigned for the acoustic encounters and the ground-truth (n = 18 encounters, see Fig. 4 and Supporting Information B / "Post_-processing_Human_detector.ipynb"file).Each row represents a combination of model architecture and the configurations used to build the image dataset for the training step such as window size and dpi.Spec, specificity; Sens, sensitivity; Prec, precision; Accu, accuracy.We also provide the total number of trainable network parameters.*Time to predict 10 min recording.(Virtanen et al., 2020).

Results
The training dataset was based on 43 boat-based encounters (D. delphis, n = 8 encounters; O. orca, n = 4 encounters; S. plumbea, n = 19 encounters; T. aduncus, n = 12 encounters) and soundscape recordings from moored hydrophones (Fig. 1).Annotated sounds used to create the training dataset totalled 723 min of audio data for which the distribution was D. delphis (45.6%),O. orca (0.96%), S. plumbea (14.38%), T. aduncus (39%), as well as 772 min of the soundscape.The training library size varied based on the computing limitations (Table 1).The testing dataset for the detection model comprised 24 h of a day and contained 18 AE varying from less than one second to ~59 min.The testing dataset for the species identification model was based on 10 to 30 min of unseen data for each of the species studied here (Supplementary Information I).The varying length of the testing dataset was due to the number of vocalisations detected in the unseen data by the CNN1, which is potentially affected by the setup of the hydrophone deployment (moored or drifting buoy) and the behavioural biology of each species (see Discussion section).Except for D. delphis, in which we have mostly used 10 min of testing data due to the higher number of detections in those recordings, all other testing files listed in Supplementary Information A per species were used to evaluate the identification models.The best model weights for CNN1 (detection) and CNN2 (species identification) were obtained using two-second segments (windows) with images generated at 70 and 90 dpi, respectively (Fig. 5).
The customised CNN architecture achieved the highest accuracy for both models, outperforming the pre-trained ResNet152V2 model with faster predictions.The best CNN1 model exhibited an 84.4% accuracy (Precision = 87.6%,Sensitivity = 56.7%,Specificity =96.4%) in defining AEs based on all dolphin sounds in the test set and 89.5% accuracy (Precision = 67.9%,Sensitivity = 76.3%,Specificity = 92.3%)for sounds with an SNR higher than 1.On the other hand, the best ResNet152V2 model (using 90 dpi and two seconds window) achieved 83.9% accuracy (Precision = 56.1%,Sensitivity = 37.2%, Specificity = 93.8%) in a similar condition (i.e., detecting sounds with SNR > 1).Increasing the dpi in the training images improved the model's precision, but decreased its sensitivity, resulting in lower accuracy (Table 1).The best CNN1 model showed lower precision than the one built using 90 dpi but higher sensitivity (or recall), thus reflecting higher F1 score (Table 1).Notably, exploratory ad hoc tests investigating the duration of the segments (i.e., window size) used to build the dataset and the resolution of the training images were crucial in determining the best detection model.
The species identification model (CNN2) only showed high accuracy when excluding one class (i.e., O. orca).The two best-performing models were obtained when using segments of two seconds and 90 dpi.Furthermore, these two models achieved the best testing results when trained on two (S.plumbea and T. aduncus) and three (S.plumbea, T. aduncus and D. delphis) classes (Fig. 6, Table 2).The only model showing >50% accuracy for O. orca sound classification was the one using the transfer learning approach, although it did not perform well when identifying S. plumbea sounds with only 9% accuracy.The highest accuracy for S. plumbea sound identification in PAM was achieved using a four-class model (including O. orca), but this model performed poorly in distinguishing O. orca sounds from other species (Fig. 6).The comparison of two two-class models (S. plumbea x T. aduncus) with distinct training library sizes (8 k and 12 k) demonstrated higher accuracy for the one built using a smaller training dataset.Inference using transfer learning was nearly twice as long as the custom CNN architectures (Table 1).

Discussion
The algorithm developed in this study assisted in finding optimal parameters to construct a suitable training dataset to be used as input to CNNs for classification tasks on complex dolphin sounds.We found that using shorter window sizes generated more accurate models for both tasks (Tables 1 and 2).With a constant dpi, we investigated the impact of window size on the classification of dolphin calls to determine if it was necessary to encompass the longest annotated call (e.g., O. orca whistle) as proposed in previous studies (Dufourq et al., 2021).The comparison of two two-class (i.e., S. plumbea x T. aduncus) models both built using customised architecture and 40 dpi, but with different window sizes (3 s and 7 s, Tables 1 and 2), demonstrated better performance for models built using smaller window size, specifically two or three seconds in length (Table 1).The best model was built using a two-second window length.Smaller window size yields a more nuanced representation of dolphin sounds, allowing for the detection of rapid frequency modulation patterns that may not be discernible in longer windows (see Fig. 2).Additionally, we demonstrated that fine-tuning the dpi parameter had a significant impact on both models' accuracy as the optimal dpi differed  between the best CNN1 (dpi = 70) and CNN2 (dpi = 90) models, and higher or lower dpi settings were not effective for both tasks.Furthermore, our results in Table 2 reveal that the differences in model accuracy, due to window size and dpi, may have accounted for variations in the number of detections considered for species identification in the different CNN2 models.Although presenting higher precision compared to the best CNN1 model described before, the model built using customised architecture, two seconds window, and 90 dpi showed lower sensitivity, thus potentially depending on strong signals from dolphin vocalisations (SNR > 1) to be detected and then classified at the species level.
The best CNN2 model successfully identified S. plumbea, T. aduncus and D. delphis sounds in a three-class classification model in the unseen data (Table 2).However, it was unable to perform well when including O. orca that, interestingly, produces distinct echolocation click train patterns and complex calls including biphonic whistles with multiple harmonics (Miller and Bain, 2000), which are quite distinguishable from other species with mostly single contour whistle repertoires (Erbs et al., 2017).The inefficiency of the four-class CNN2 model can likely be attributed to the small sample size for O. orca, representing only ~0.9% of all annotated dolphin sounds which was potentially limited by a small diversity of calls and behavioural contexts (Oswald et al., 2008;Quick and Janik, 2008).Nevertheless, the three-class CNN2 model represents a significant advance in dolphin sound classification tasks for taxonomic identification, especially for S. plumbea monitoring in South African waters.It is worth stressing that O. orca is not as common as T. aduncus or D. delphis (Best et al., 2010;Melly et al., 2018).They also produce visually distinguishable sounds from the other dolphins investigated, Fig. 6.Comparison of confusion matrices for a four-class (left) and a three-class (right) species identification model applied to the testing, unseen dataset (Supplementary Information A).Both models were trained using a customised architecture, two seconds window size to extract the annotated sounds from boat-based recordings, and the resulting spectrograms (images) used to train the models maintained a resolution of 90 dpi.

Table 2
Species identification models performance.n, the number of segments in the testing file detected by the CNN1 model (see Material and Methods section) that was used to assign species identification.Each row represents a combination of model architecture, the configurations used to build the image dataset for the training step such as window size and dpi, and the classes (i.e., species) used to build the model.Differences on accuracy related to library size was evaluated between two two-class models (S. allowing them to be manually checked in a post-hoc analysis of results.Future investigation may address transfer learning using the same optimal window size and dpi found for the best CNN2 model as this approach, in our case, performed better than any other model for O. orca sounds (Table 2).
The approach proposed in this study presents a promising framework for future assessments on dolphin detection and identification using PAM recordings as the algorithm was based on the biology of dolphin sounds.The nature of vocal production varies considerably among dolphins as some species are more actively vocal than others, potentially driven by group size dynamics (Oswald et al., 2008;Quick and Janik, 2008) (Fig. 2), resulting in a different number of sound detections extracted from the training dataset for each species, despite a similar number of boat-based encounters (see Material and Methods section, Supporting Information A).We balanced the dataset to account for the imbalance of the total number of detections per species, to match the largest dataset for a class (i.e., D. delphis).Also, dolphin vocal production is dependent on its behavioural context (Quick and Janik, 2008), and thus we also balanced each species dataset per encounter to ensure equal weights for the sounds produced in different contexts.Indian Ocean humpback dolphins, for example, presented long periods of echolocation click trains while on other occasions only a few whistles (personal observation on the training dataset).This approach ensured a better representation of whistles in the dataset for this species.
The use of AEs to define a time period of dolphin detections not only assisted in species identification by handling potential false positives but also defined periods of dolphin activity near moored hydrophones that may be useful for future ecological studies.Here, we built a framework to test the efficiency of the detection model (i.e., CNN1) based on AEs (Fig. 4) as the identification model (i.e., CNN2) was dependent on the sounds captured within each AE.In other words, we assessed the taxonomic identification of dolphin sounds based on the proportion of classified segments for each species in a certain time period (i.e., AE).We used this approach as, for certain species, the classification tasks based on one call may not be recommended (Rankin et al., 2017) due to the time-frequency characteristics of vocalisations overlapping with other species in the area, thus contributing to decreased accuracy in classification models (Yang et al., 2020).Killer whales are known to be able to mimic other dolphin species (Musser et al., 2014) and other marine mammal species (Foote et al., 2006).As such, it is necessary to consider the context in which those sounds were produced, instead of identifying single clicks, burst pulses, or whistles.Although our algorithm does not identify mixed species groups, it might assist future dedicated research on this complex task.One can still verify the proportion of classified detections for each AE that is given in the output, and even experiment with more conservative times between AE thus assigning species identification based on more individualised groups of vocalisations.In this context, it is important to emphasize that a drawback of employing CNNs, in the manner our algorithm was designed is a limitation of identifying only one species per second.Consequently, the model is unable to distinguish between detections where two species are vocalizing simultaneously.However, this topic needs to be further investigated in detail.
Our study showcases the exceptional performance of CNNs in accurately classifying complex biological patterns such as click trains across species.Specifically, the testing data for S. plumbea was composed of a few whistles and a long series of click trains, for which the model correctly assigned 96.9% of the detected dolphin sounds (n = 166, Table 2).In this way, most of the click detections were correctly assigned at the species level.It is worth noting that the sample rate used here (i.e., 96 kHz) did not capture all of the dolphin click energy that can reach up to 150 kHz (Au, 2000).However, the decision to use a sample rate of 96 kHz was made as this is a widely used sampling frequency that captures the entire frequency range of most dolphin whistles (Au, 2000) while maximizing the deployment time for moored hydrophones, compared to full bandwidth recordings.

Conclusion
This study aimed to develop a sound classifier to acoustically monitor the critically endangered humpback dolphin in South African waters.As this species coexists with three other whistling dolphin species in the study region (Findlay et al., 1992), a species identification model was deemed essential.Our findings are encouraging and can greatly assist conservation efforts by providing a tool for ecologists and researchers.The algorithm holds significant promise as a tool to be further developed for the monitoring and research on Indian Ocean humpback dolphin acoustics in long-term recordings.The spatiotemporal definition of AEs to investigate Indian Ocean humpback dolphins' activity may assist studies on habitat use (Caruso et al., 2020) and those using individually distinctive signature whistles (Deecke and Janik, 2006;Janik et al., 2013) as input to mark-recapture approaches for population dynamics studies (Longden et al., 2020).The proposed framework can be adapted to other similar tasks involving PAM and species identification tasks, especially on cetaceans.The automated adjustment of main parameters such as sample rate, dpi, and window size enhances the adaptability of the application.The output of the application may define the time of dolphin activity near a moored hydrophone, with a customisable time period between AEs that can be tailored to other locations and studies.Dolphins mostly live in a fissionfusion society, so the AE definition (see Material and Methods section) can be adapted for other species to assist with social-network studies based on group composition within a time frame (Whitehead, 2008).
We demonstrated the power of CNNs on the taxonomic identification of dolphin sounds.The open-source application presented here advances the research in improving the detection and identification of dolphin vocalisations in audio recordings and will be valuable for monitoring the endangered Indian Ocean humpback dolphin in South African waters.The effective performance of the algorithm provided here encourages future research on using customisable CNNs and algorithms for the identification of complex signals.The proposed framework was designed to easily fine-tune classification tasks of biological sounds and may increase the use of CNNs through a near-friendly, Linux operating system interface.Future research may address further improvement on the detectability of dolphin vocalisations, enhancing identification accuracy, and categorising these sounds to potentially assign specific behavioural activity for each AE.Moreover, further research should be conducted to reduce processing time and facilitate real-time monitoring, thereby expanding the potential applications of this algorithm.The utilization of a high-performing application for dolphin identification in low-cost devices with "low" sample rates (i.e., 96 kHz) could prove invaluable for PAM in low-income countries (Lamont et al., 2022), particularly for optimising battery and deployment time.The development of effective remote sensing tools to monitor endangered dolphin species with optimised sampling rates may help expand hydrophone networks and cover larger areas in longer periods.There are fewer than 500 humpback dolphins remaining in South African waters (Vermeulen et al., 2018) and the population is under severe threat from anthropogenic (Plön et al., 2015) and natural impacts (Frainer et al., 2022).The proposed framework could be further refined by incorporating a new class into CNN2 to identify potential threats to the Indian Ocean humpback dolphin such as boat traffic, while assisting population dynamics and habitat use studies on this endangered species.The longterm monitoring of this species using acoustics may ensure a replicable way to evaluate changes in population dynamics in historic sites of occurrence.

Fig. 1 .
Fig. 1.Locations of the boat-based recordings of the common dolphin (Delphinus delphis), the killer whale (Orcinus orca), the Indo-Pacific bottlenose dolphin (Tursiops aduncus), and the Indian Ocean humpback dolphin (Sousa plumbea) used to build the training dataset.Recordings from moored (circles) and drifting buoy-attached (triangle) hydrophones used as the testing dataset are represented in red.(For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)

Fig. 2 .
Fig. 2. Examples of spectrograms showing calls of all four species studied here built with distinct window sizes (two-, three-, five-and seven-seconds length).Sample rate 96 kHz (Nyquist frequency 48 kHz), Hann window size of 1024 samples, and a hop size of 128 samples (75% overlap).

Fig. 3 .
Fig. 3.The general pipeline of the algorithm used to build (Training) and test (Testing) the models.

Fig. 4 .
Fig. 4. Detection model evaluation based on acoustic encounters (AEs).The confusion matrix was built based on resultant AEs assigned by the model compared to manually annotated data (human detector/ground truth).

Fig. 5 .
Fig. 5. Indian Ocean humpback dolphin (Sousa plumbea) vocalisations captured in a single two second window length segment and converted to a linear spectrogram in images with dots per inch (dpi).Sample rate 96 kHz (Nyquist frequency 48 kHz), Hann window size of 1024 samples, and a hop size of 128 samples (75% overlap).
plumbea x T. aduncus) using customised architecture, two seconds window and 70 dpi.*Four ten-minutes files were used for this testing.**Total dataset size of 12 k images.***Total dataset size of 8 k images.